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ABSTRACT 


'  '  r  '  '  ‘  ' 

-«s 

,-.A  speech  analysis  system  based  on  a  combination  of  physiological  and 
psychoacoustic  results  has  been  developed.  The  system  contains  a 
nonuniform  Filter/Detector  bank.  A  new  relationship  between 
Filter/Detectors  and  the  Short-Time  Fourier  Transform  magnitude  is 
derived,  and  a  generalized  version  of  the  Short-Time  Fourier  Transform 
magnitude  is  used  to  implement  the  analysis  system.  The  new  relationship 
is  also  applied  to  a  discussion  of  channel  vocoders,  spectrograms,  the 
sliding  Discrete  Fourier  Transform,  average  power  spectrum  estimation, 
and  nonuniform  bandwidth  analysis.  Next,  a  new  synthesis  approach  is 
used  to  reconstruct  signals  from  the  magnitude  data  produced  by  the 
nonuniform  analysis.  Apart  from  an  overall  sign  factor,  the 
analysis/synthesis  system  achieves  exact  reconstruction  in  the  absence  of 
data  modification.  The  ability  of  the  system  to  reconstruct  signals  from 
modified  data  is  also  demonstrated.  Suggestions  for  further  research. 
Including  data  reduction  and  Automatic  Speech  Recognition  applications, 
are  given.  K  ^  -.it'  1  ■  '  '■*  ‘  f  "  ' '  f 
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CHAPTER  1 

INTRODUCTION 


1.1  MOTIVATION 

Information  exchange  between  human  beings  often  takes  place  in  the 
form  of  audio  communication,  or  talking  and  listening.  This  form  of 
communication  is  convenient  and  provides  a  rapid  means  of  Information 
transfer.  Audio  communication  between  humans  and  computers  is  also 
useful.  Computer  voice  synthesis  can  replace  warning  lights  and  other 
displays,  and  Automatic  Speech  Recognition  (ASR)  devices  can  act  as 
keyboard  replacements. 
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Computer  speech  input/output  has  several  advantages  over  other  forms 
of  man-to-machine  communication.  Since  audio  communication  devices 
occupy  minimal  physical  volume  they  can  be  used  where  large  displays  and 
keyboards  ace  unacceptable.  Speech  allows  "hands-off"  communication  of 
data  as  required  for  parcel  sorting  or  wheelchair  control.  In  addition, 
speech  can  provide  convenient  access  to  computer  information  via  the 
telephone. 

Humans  have  a  speech  recognition  ability  which  is  superior  to  that 
of  existing  ASR  machines.  Disregarding  effects  such  as  visual  cues  and 
contextual  information,  humans  make  speech  recognition  judgements  based 
on  information  from  the  auditory  system.  Therefore,  if  results  from 
perceptual  and  physiological  studies  of  the  auditory  system  are  applied, 
it  may  be  possible  to  design  improved  ASR  machines. 

When  applying  auditory  system  results  to  the  design  of  ASR  machines 
it  is  useful  to  understand  what  information,  if  any,  is  lost  in  the  first 
analysis  stage  (or  "front-end")  of  the  system.  Inappropriate  front-end 
information  loss  can  degrade  overall  ASR  system  performance.  For 
example,  if  a  poorly  designed  front-end  produces  the  same  output  in 
response  to  two  perceptually  different  input  words,  subsequent  processing 
stages  must  rely  on  contextual  information  rather  than  the  analyzed 
acoustic  waveform  to  make  a  correct  identification.  It  may  therefore  be 
possible  to  improve  system  performance  if  such  information  loss  is 
avoided. 
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The  problem  of  front-end  Information  loss  can  be  discovered  when  a 
synthesis  technique  is  used  to  test  the  analyzed  speech  data  for  suitable 


Information  content.  Furthermore,  analyzed  data  may  be  subjected  to  a 
variety  of  transformations  in  order  to  reduce  the  data  rate  or 
Investigate  various  auditory  processes.  Effects  of  such  transformations 
may  be  examined  by  application  of  appropriate  inverse  transformations  and 
signal  synthesis  from  the  processed  data. 


1.2  HISTORICAL  DEVELOPMENT  OF  THE  PROBLEM 

The  Sound  Spectrograph  is  a  widely  used  tool  for  creating  speech 
spectrum  displays,  or  spectrograms.  A  number  of  researchers  have  devised 
machines  for  reconstructing  speech  from  spectrograms  (Flanagan  [lj), 
thereby  creating  a  speech  analysis/synthesis  system.  The  intelligible 
monotone  speech  produced  by  such  machines  has  been  used  in  extensive 
perceptual  studies.  The  Sound  Spectrograph  itself  provides  an  audio 
analysis  which  is  uniform  with  respect  to  frequency,  and  thus  does  not 
model  human  perception.  Development  of  an  auditory  spectrogram-like 
representation  is  a  current  research  goal  (Carlson  and  Granstrom  [2]). 


In  a  related  area,  spectrogram-like  representations  can  be  generated 
from  the  Short-Time  Fourier  Transform  (STFT)  magnitude  (Rabiner  and 
Schafer  [3]).  Since  signals  can  be  reconstructed  from  the  STFT  magnitude 
(Altes  [4];  Nawab,  Quatleri,  and  Lim  [5j,  [6]),  a  speech 

analysis/synthesis  system  can  be  developed  using  STFT  techniques.  As 
with  spectrograms,  however,  this  approach  provides  an  analysis  which  is 
uniform  with  respect  to  frequency.  The  STFT  can  be  modified  for 
nonuniform  analysis  (Gambardella  ( 7 ] ;  Youngberg  and  Boll  [8]),  but  the 
corresponding  synthesis  techniques  reported  in  the  literature  require 
both  magnitude  and  phase  of  the  modified  STFT  to  perform  signal 

reconstruction.  Since  available  magnitude-only  reconstruction  techniques 
(Nawab,  Quatleri,  and  Lim  [9])  use  autocorrelation  functions  rather  than 
performing  reconstruction  directly  from  spectral  values,  such  techniques 
cannot  be  modified  for  nonuniform  analysis/synthesis.  Furthermore, 
available  approaches  do  not  generally  achieve  exact  signal  reconstruction 
In  the  absence  of  data  modification  (Griffin  and  Lim  [10]).  Exact 
reconstruction  is  a  desirable  feature  for  algorithmic  verification 


1.3  THE  SCOPE  OF  THIS  REPORT 


This  report  presents  a  speech  analysis/synthesis  system  based  on 
perception.  Physiological  and  psychoacoustic  results  suggest  that  a 
nonuniform  bank  of  Filter/Detector  (F/D)  subsystems  can  be  used  in  the 
speech  analysis  system,  as  shown  in  Chapter  2.  A  new  relationship 
between  F/D  subsystems  and  the  STFT  magnitude  (or,  equivalently,  the  STFT 
magnitude  squared)  is  described,  and  a  generalized  version  of  the  STFT 
magnitude  is  used  to  implement  the  desired  F/D  bank.  A  new  synthesis 
approach  capable  of  reconstructing  signals  from  the  generalized  STFT 
magnitude  is  described  in  Chapter  3.  Examples  of  results  produced  by  the 
analysis/synthesis  system  are  presented  in  Chapter  4.  Apart  from  an 
overall  sign  factor,  the  system  achieves  exact  reconstruction  in  the 
absence  of  data  modification.  The  ability  of  the  system  to  reconstruct 
signals  from  modified  data  is  also  demonstrated.  A  summary  is  given  in 
Chapter  5,  along  with  suggestions  for  further  research.  Appendix  A 
presents  standard  definitions  for  reference  purposes.  Prerequisite  F/D 
theory,  which  is  used  throughout  the  report,  is  presented  in  Appendix  B. 
Several  approaches  to  computation  of  the  generalized  STFT  magnitude  are 
described  in  Appendix  C.  Appendix  D  applies  the  new  relationship  between 
F/D  subsystems  and  the  STFT  magnitude  to  a  discussion  of  channel 
vocoders,  spectrograms,  the  sliding  Discrete  Fourier  Transform,  average 
power  spectrum  estimation,  and  nonuniform  bandwidth  analysis. 
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CHAPTER  2 


SPEECH  ANALYSIS  SYSTEM 


2.1  INTRODUCTION 


In  tnis  chapter,  a  simplified  model  of  the  (.monaural/  human 
peripheral  auditors  system  is  developed  from  a  combination  of 
phvsiological  and  psychoacous tic  data.  Binaural  effects  will  not  be 
discussed,  although  such  effects  may  be  important  for  Automatic  Speech 
Recognition  applications  In  noisy  environments  (L^on  i i 1 i  ) .  a  generalized 
version  of  trie  Sr  or— Time  Fourier  Transform  magnitude  squared  i-  used  to 
o;  £  digit  a’.  *<■  t-  at  a;  of  the  mo-el.  S:..r  t-"':  s: .  -•  ■  ?  alst 

commuted,  and  minimum  sanc-li:,»c  rate  .ssues  are  discussed. 


2.2  A  SIMPLIFIED  AUDITORY  MODEL  BASED  ON  PHYSIOLOGICAL  RESULTS 


Fig.  2.1  is  a  diagram  of  the  human  peripheral  auditory  system 
showing  the  outer,  middle,  and  inner  ear  structures  (Flanagan  (lj).  The 
drawing  is  not  to  scale,  and  some  structures  are  enlarged  for 
illustrative  purposes.  In  the  auditory  system,  sounds  entering  the  outer 
ear  travel  through  the  middle  ear  and  generate  pressures  in  the  inner  ear 
fluids.  The  cochlea,  a  structure  in  the  inner  ear,  contains  the  basilar 
membrane  which  functions  as  a  filter  bank.  Basilar  membrane  motion 
causes  hair  cells  in  the  organ  of  Corti  to  produce  firings  on  the 
auditory  nerve,  which  contains  approximately  30,000  fibers.  A  number  of 
researchers  have  studied  firing  patterns  by  inserting  microelectrodes 
into  the  auditory  nerve  fibers  of  anesthetized  animals  (Kiang  [12]; 
Frishkopf  [13];  Katsuki,  Suga,  and  Kanno  [14]).  Such  studies  indicate 
that  the  peripheral  auditory  system  can  be  roughly  modeled  as  a 
Filter/Detector  (F/D)  bank,  and  model  parameters  can  be  derived  from 
physiological  data. 

Fig.  2.2  presents  a  F/D  subsystem  of  the  type  often  used  in  auditory 
models  (eg.,  see  Siebert  [15]).  The  input  is  analogous  to  pressure  at 
the  eardrum,  and  the  output  simulates  various  firing  pattern  features 
which  will  be  described  later  in  this  section.  For  simplicity,  the 
effects  of  spontaneous  nerve  firing  activity  have  been  omitted  from  the 
model.  The  F/D  subsystem  of  Fig.  2.2  consists  of  a  bandpass  filter, 
memoryless  nonlinearity,  and  lowpass  smoothing  filter  (see  Section  B.2 
for  a  detailed  description  of  the  various  F/D  subsystem  components).  The 
bandpass  filter  impulse  response  is  a  lowpass  window  function  h(t) 
modulated  by  a  sinusoid  of  frequency  fic.  The  window  function  has 
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Figure  2.1:  Diagram  of  Che  Human  Peripheral  Auditory  System 


Appropriate  model  parameters  can  be  obtained  from  an  examination  of 
physiological  measurement  techniques  and  the  resulting  physiological 
data.  For  example,  the  response  of  a  nerve  fiber  to  acoustic  impulses,  or 
clicks,  is  often  described  by  a  poststimulus-time  (PST)  histogram.  A 
stimulus  is  repeated  a  large  number  of  times,  and  the  PST  histogram 
depicts  the  density  of  firings  as  a  function  of  time  following  the 
stimulus.  Thus,  a  PST  histogram  indicates  the  likelihood  that  a 
particular  nerve  will  fire  at  a  given  time  following  the  stimulus.  Firing 
patterns  of  individual  nerves  are  not  similar  in  appearance  to  a  PST 
histogram.  It  is  assumed,  however,  that  firings  from  a  large  population 
of  similar  nerves  could  be  combined  to  produce  a  deterministic  pattern 
approximating  a  PST  histogram. 

A  PST  histogram  is  shown  in  Fig.  2.3a  for  rarefaction  clicks,  and  in 
Fig.  2.3b  for  condensation  clicks.  The  experimental  animal  was  a  cat, 
the  c.ick  level  was  -70dB  re  LOO  volt  input  to  the  condenser  earphone, 
and  the  nerve  fiber  was  maximally  responsive  at  a  frequency  of  1.67  KHz 
Slang  I'.l;'-.  Fig.  2.4  presents  eighteen  further  examples  of  PST 
ois: 'grams  for  various  characteristic  frequencies  from  a  single  cat.  The 
level  for  Fig.  2.4  was  -50dB.  Note  the  loss  of  timing  details  for 
ceristic  frequencies. 
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Figure  2.3:  PST  Histogram  Data  and  Corresponding  Model  Results 
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Figure  2.4:  PST  Histograms  for  Various  Characteristic  Frequencies 
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When  the  Input  to  the  F/D  subsystem  of  Fig.  2.2  is  an  impulse,  model 
parameters  can  be  chosen  so  that  the  output  mimics  a  PST  histogram  over  a 


limited  range  of  intensities.  As  a  specific  example,  parameters  are 
chosen  so  that  the  PST  histograms  of  Figs.  2.3a  and  2.3b  are  simulated. 
From  Figs.  2.3a  and  2.3b  it  can  be  seen  that  the  delay  is  i=.0024  second 
and  the  characteristic  frequency  is  £1c=2itx1670  radians  per  second.  Since 
the  characteristic  frequency  is  low  enough  so  that  timing  details  are 
preserved,  the  smoothing  filter  has  no  effect  and  is  eliminated  by 
choosing  hs(t)=d(t).  Use  of  o=2500  sec'^  and  g=9xl0^  5ec‘‘  results  in  a 
reasonable  match  to  the  data.  The  F/D  output,  v(t),  is  shown  in  Fig. 
2.3c  for  an  input  x(t)=-d(t),  and  in  Fig.  2.3d  for  an  input  x(t)=6(t). 

Model  parameters  can  be  chosen  to  mimic  many  features  of  auditory 
nerve  patterns  for  clicks  and  steady  sine  waves  over  limited  intensity 
ranges  (Siebert  [15]).  Agreement  over  wider  intensity  ranges,  and  for 
stimuli  such  as  tone  bursts  and  noise,  can  be  obtained  by  inserting  an 
Automatic  Gain  Control  (AGC)  at  the  bandpass  filter  output.  Recent 
research  (Smith  and  Zwislocki  [20];  Smith  [21];  Harris  and  Dallos  [22]) 
suggests  use  of  a  short-term  adaptation  function  rather  than  an  AGC.  In 
any  case,  such  improvements  will  not  be  considered  here. 

The  PST  histogram  envelope,  which  represents  a  short-term  average 
firing  rate,  is  often  a  function  of  interest  (Schroeder  [19]).  If  the 
lowpass  smoothing  filter  bandwidth  0S  is  chosen  such  that  20f1<ils<0c-2Sl|1, 
then  the  F/D  subsystem  impulse  response  mimics  the  PST  histogram  envelope 
rather  than  the  detailed  PST  histogram  (see  Section  B.3.6).  Under  these 
conditions  the  F/D  output  is  proportional  to  v"  f— r  ' .  *nd  Fig.  2.3  snows 
this  function  superimposed  on  tne  simulated  ■'ST  .us tog-  s.as  of  Fig. 


In  general,  It  can  be  shown  that  Che  envelope  of  a  half  wave  square 
law  device  output  is  proportional  to  the  envelope  of  a  square  law  device 
output  (see  section  B.2.2).  Thus,  if  the  smoothing  filter  bandwidth  is 
chosen  so  that  the  output  follows  the  PST  histogram  envelope,  then  the 
half  wave  square  law  device  can  be  replaced  by  a  square  law  device.  It 
will  be  shown  in  Section  2.4  that  the  resulting  F/D  subsystem  can  be 
implemented  by  a  generalized  form  of  the  Short-Time  Fourier  Transform 
(STFT)  magnitude  squared.  Therefore,  a  STFT  magnitude  squared  approach 
can  be  used  to  roughly  simulate  PST  histogram  envelope  functions  at  low 
frequencies  (ftc<2irx4000  radians  per  second),  and  PST  histograms  at  high 
frequencies.  A  simplified  auditory  model  based  on  short-term  average 
firing  rates  is  thus  implemented  using  STFT  techniques. 

The  model  described  in  this  section  does  not  attempt  to  account  for 
all  known  aspects  and  limitations  of  the  auditory  system.  The  exact 
manner  in  which  signals  are  encoded  by  the  auditory  system  is  a  current 
research  topic,  and  several  theories  have  recently  been  developed  (see 
for  example  Sachs  and  Young  [23J,  [24];  Delgutte  and  Kiang  [25]). 

Instead,  the  model  demonstrates  an  approximate  relationship  between 
certain  physiological  results  and  F/D  or  STFT  magnitude  analysis 
techniques,  and  shows  how  standard  analysis  techniques  must  be  modified 
for  auditory  modeling  purposes.  Although  the  auditory  model  presented  in 
this  section  is  crude,  it  will  be  shown  in  Chapter  3  that  no  important 
information  is  lost  by  such  an  approach  since  signals  can  be  synthesized 
from  the  simplified  auditory  model  outputs. 


2.3  A  SIMPLIFIED  AUDITORY  MODEL  INCORPORATING  PERCEPTUAL  RESULTS 


Although  auditory  model  parameters  can  be  derived  from  physiological 
data,  there  is  no  guarantee  that  the  resulting  model  will  simulate  human 
perception.  Recall  that  physiological  data  is  generally  obtained  from 
experimental  animals  rather  than  humans.  Furthermore,  since  available 
data  mainly  concerns  the  peripheral  auditory  system,  effects  of  higher 
processing  levels  are  not  included  in  models  based  on  such  data  alone. 
In  order  to  develop  a  speech  analysis  system,  it  is  desirable  to  account 
for  at  least  some  known  aspects  of  human  auditory  perception. 
Supplementary  information  is  therefore  required  for  the  determination  of 
appropriate  auditory  model  parameters. 

The  field  of  psychoacoustics  provides  an  alternative  means  of 
investigating  the  auditory  systea.  Listening  experiments  are  performed 
on  live  human  subjects,  and  the  results  indicate  functional  behavior  of 
the  complete  auditory  system.  One  useful  psychoacoustic  result  is  the 
concept  of  a  critical  band.  A  critical  band  has  been  defined  (Scharf 
(261)  as  the  bandwidth  at  which  subjective  responses  change  abruptly. 
For  example,  assume  that  a  listener  is  subjected  to  a  bandlimlted  noise 
stimulus.  The  bandwidth  of  the  stimulus  is  varied  but  a  constant  sound 
pressure  level  is  maintained.  As  long  as  the  bandwidth  of  the  noise  is 
less  than  a  critical  band,  perceived  loudness  of  the  noise  remains 
constant.  When  the  bandwidth  of  the  noise  increases  beyond  a  critical 
band,  perceived  loudness  of  the  noise  begins  to  increase.  Since  similar 
critical  bands  are  encountered  in  a  variety  of  different  perceptual 
experiments,  critical  bands  are  often  used  to  describe  the  filtering 
process  assumed  to  take  place  within  the  auditory  system. 


A  model  of  Che  human  auditory  system  based  on  perception  can  be 
constructed  by  combining  physiological  and  psychoacoustlc  results.  The 
structure  of  Che  model  is  determined  from  physiology,  as  discussed  in 
Section  2.2.  Empirical  critical  bandwidth  data  from  humans,  rather  than 
physiological  tuning  curve  or  PST  histogram  data  from  animals,  is  then 
used  to  determine  the  bandpass  filter  center  frequencies  and  bandwidths. 

Table  2.1  presents  the  necessary  parameters  for  design  of  a  critical 
bandwidth  filter  bank  (Scharf  [26 J).  Note  that  the  critical  bandwidth 
can  be  expressed  as  a  continuous  function  of  center  frequency  by 
interpolating  the  data  of  Table  2.1.  Fifteen  filters  are  chosen  to 
adequately  cover  the  200-3675  Hz  frequency  range.  The  filters  have 
nonuniform  center  frequency  spacing,  and  bandwidth  which  increases  with 
center  frequency.  The  filters  are  roughly  constant  bandwidth 
(approximately  110  Hz)  for  frequencies  below  700  Hz,  and  constant  Q 
(center  frequency  to  bandwidth  ratio  of  approximately  6.4)  above  700  Hz. 
Although  recent  estimates  of  auditory  filter  shape  suggest  use  of 
different  values  below  500  Hz  (tloore  and  Glasberg  1 2 7 J ) ,  the  data  of 
Scharf  will  be  used  to  design  this  speech  analysis  system. 


TABLE  2.1 

CRITICAL  BANDWIDTH  FILTER  BANK  PARAMETERS 
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2.4  CRITICAL  BANDWIDTH  FILTER/ DETECTOR  BANK  IMPLEMENTATION 

This  section  describes  implementation  of  a  critical  bandwidth 
Filter/Detector  (F/D)  bank,  which  will  be  employed  as  part  of  the  speech 
analysis  system.  First,  a  new  relationship  between  F/D  subsystems  and 
the  continuous-time  Short-Time  Fourier  Transform  (STFT)  magnitude  squared 
is  described.  The  new  relationship  demonstrates  that  a  specific  type  of 
F/D  subsystem  can  be  implemented  via  the  STFT  magnitude  squared.  Next, 
the  discrete-time  case  is  described  and  then  generalized  to  allow 
implementation  of  a  critical  bandwidth  F/D  bank.  Finally,  the 
specifications  given  in  Sections  2.2  and  2.3  are  used  to  design  the 
desired  F/D  bank  via  STFT  techniques. 


2.4.1  CONTINUOUS-TIME  SHORT-TIME  FOURIER  TRANSFORM  DEFINITION 

The  STFT  is  a  widely  used  approach  to  time-dependent  frequency 
analysis.  For  the  continuous-time  case,  the  STFT  evaluated  at  some  fixed 
frequency  12 c  is  defined  as  (Flanagan  [1]): 

-H»  cT 

Xt(jftc)  3  /  x(t )h( t—r  )e  dx .  (2.2) 
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Note  that  if  h(t)=l  for  all  t,  the  STFT  becomes  the  continuous-time 
Fourier  transform  described  in  Appendix  A.  A  block  diagram  for  STFT 
computation,  which  expresses  the  STFT  in  terms  of  linear  filtering 
operations,  is  shown  in  Fig.  2.6a.  This  interpretation  indicates  that 
the  STFT,  viewed  as  a  function  of  time  at  the  fixed  frequency  ftc,  is  a 
lowpass  complex  function  bandlimited  to  the  window  function  bandwidth. 

An  equivalent  STFT  definition  is: 

incT 

Xt(jftc)  *  e  I  x(t-r)h<T)e  dx  (2.3) 

-oo 

The  corresponding  block  diagram  is  shown  in  Fig.  2.6b.  In  this  approach, 
a  complex  modulation  signal  is  used  to  downconvert  the  bandpass  filter 
output  into  a  lowpass  function. 

It  follows  from  Equation  2.3  that  the  imaginary  part  of  e  Xt(jSic) 
is  the  output  of  a  bandpass  filter  which  has  input  x(t)  and  impulse 
response  h(t)sin(flct) .  Thus,  the  STFT  could  be  used  to  implement  the 
bandpass  filter  portion  of  the  F/D  subsystem  shown  in  Fig.  2.2.  This 
approach,  however,  will  not  be  pursued. 

The  methods  for  STFT  computation  shown  in  Fig.  2.6  employ  a  local 
oscillator  (Taub  and  Schilling  (28J).  Thus,  the  STFT  is  different  from  a 
detection  process  which  uses  a  F/D.  It  will  be  shown  in  Section  2.4.2 
that  the  STFT  magnitude  squared,  rather  than  the  STFT,  corresponds  to  a 
detection  process  using  a  F/D. 
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(a)  Modulator  followed  by  Lowpass  Filter 
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(b)  Bandpass  Filter  followed  by  Modulator 


2.6:  Linear  Filtering  Interpretation  of  the  STFT 


2.4.2  CONTINUOUS-TIME  F/D  IMPLEMENTATION  USING  STFT  MAGNITUDE  SQUARED 


F/D  subsystems  have  long  been  used  as  a  means  of  approximating  the 
STFT  magnitude  squared.  Early  work  by  Fano  [291  described  a  relationship 
between  F/D  subsystems  and  the  STFT  magnitude  squared  for  special  window 
functions.  Schroeder  and  Atal  [ 30 J  extended  this  work  to  include 
arbitrary  window  functions,  and  the  results  are  discussed  by  Flanagan  [lj 
and  Gambardella  [7].  However,  these  authors  did  not  characterize  basic 
F/D  parameters  such  as  lowpass  smoothing  filter  bandwidth*  Flanagan  [1] 
discusses  a  relationship,  valid  only  for  certain  signals  under 
restrictive  conditions,  which  links  the  STFT  magnitude  with  speech 
spectrograms  (see  Section  D.3).  Flanagan  also  discusses  a  relationship 
between  long-term  average  F/D  outputs  and  an  averaged  version  of  the  STFT 
magnitude  squared. 

In  tnis  section,  a  new  relationship  between  F/D  subsystems  and  the 
STFT  magnitude  squared  is  described.  The  new  relationship  is  more 

precise  than  those  previously  reported  in  the  literature,  and 
demonstrates  the  equivalence  between  the  STFT  magnitude  squared  and  a 
specific  type  of  F/D  subsystem. 
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2.4.2. 1  PLAUSIBILITY  ARGUMENT 


A  system  for  computing  the  STFT  magnitude  squared  is  shown  in  Fig. 
2.7a.  From  this  system  and  the  modulation  property  of  Fourier  transforms 
(described  in  Appendix  A),  it  is  easily  seen  that  |Xc(jftc)|2  is  a  lowpass 
real  function  of  time  which  is  bandlimited  to  twice  the  window  function 
bandwidth. 

An  equivalent  system  for  computing  the  STFT  magnitude  squared  is 
shown  in  Fig.  2.7b.  In  this  figure,  the  output  of  each  square  law  device 
consists  of  a  lowpass  function  and  a  high  frequency  bandpass  function 
(see  Section  B.3.1).  The  high  frequency  bandpass  functions  cancel  out  in 
the  adder,  while  the  lowpass  functions  combine  to  form  |Xt(jI2c)|^. 

The  fact  that  only  lowpass  functions  are  retained  by  tne  STFT 
magnitude  squared  suggests  that  a  similar  result  could  be  produced  by  the 
F/D  subsystem  of  Fig.  2.7c.  Details  of  this  F/D  will  be  described  in 
Section  2.4. 2. 2.  In  Fig.  2.7c,  the  high  frequency  components  at  the 
square  law  device  output  are  eliminated  by  linear  filtering  rather  than 
cancellation.  Thus,  the  STFT  magnitude  squared  and  F/D  outputs  will 
generally  be  similar  but  not  identical. 


(b)  Exact  Computation  using  Bandpass  Filters 
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Approximate  Computation  using  a  Bandpass  Filter  and  Detector 
.7:  Short-Time  Fourier  Transform  Magnitude  Squared  Computation 


2. 4. 2. 2  PROOF  OF  F/D  AND  STFT  MAGNITUDE  SQUARED  EQUIVALENCE 


The  F/D  subsystem  shown  In  Fig.  2.7c  consists  of  a  Linear 
Time-Invariant  (LTI)  bandpass  filter,  square  law  device  with 
multiplicative  constant,  and  LTI  smoothing  filter.  The  impulse  response 
of  the  bandpass  filter  contains  an  arbitrary  constant  parameter  6.  If 
0*-n/2  for  example,  the  bandpass  filter  impulse  response  is  h(t)sln(Qct). 
Nomenclature  for  the  signals  in  Fig.  2.7c  follows  that  of  the  general  F/D 
theory  presented  in  Appendix  B. 

Let  the  window  function  h(t)  be  the  impulse  response  of  an  ideal 
lowpass  filter  with  bandwidth 

h(t)  ■  [sin(Oht)  J/ir  t.  (2.4) 

The  output  of  the  bandpass  filter  in  Fig.  2.7c  is: 
y(t)  =  x(t)*[h(t)cos(S)ct+e  )  j 

■/  x(t  )h(t-t )  cos  (ft  ct-ft  CT -K3  )dt 

=  f(t)cos(nct)  +  g(t)sin(ftct) ,  (2.5) 

where 

f(t)  *  (x(t)cos(ftct-6 ) ]*h(t)  (2.6) 


and 


g(t)  m  [x(t)8in(ftct-0 ) ]*h(t) 


(2.7) 


are  lowpass  functions.  The  square  law  device  output  Is: 


2w^(t)  -  lf2(t)+g2(t)] 

+  [cos(2Slct)  ]  {f  2(t)-g2(t)]  +  2[sin(2Slct)  ]f  (t)g(t) .  (2.8) 

Since  f(t)  and  g(t)  are  lowpass  bandllmited  to  the  frequency  domain 
region  |0|<Rh»  c^e  function  f2(t)+g2(t)  is  lowpass  bandllmited  to 
|tt  |<2Slh.  The  remaining  components  of  Equation  2.8  are  high  frequency 
bandpass  signals  limited  to  the  region  2ftc-2flh<|8 j<2flc+2fth»  Let  the 
smoothing  filter  with  impulse  response  hs}(t)  be  an  Ideal  lowpass  filter 
having  bandwidth  fl8l: 

h8 1 ( t )  -  [sin(fts  jt)/ir  t J  •  (2.9) 

Also,  let  2nh<n3l<2nc-2nh.  It  follows  that  20h<ilc.  The  F/D  output  is 

2v0(t)  -  f2( t)  +  g2(t),  (2.10) 

which  Is  positive  even  though  the  impulse  response  of  the  smoothing 
filter  is  not  positive  for  all  t.  Since  the  signals  x(t)  and  h(t)  are 
real,  it  follows  from  Equations  2.6,  2.7,  and  2.10  that 

2v0(t)  *  {  lx(t)co8(e-flct)]*h(t)| 2  +  { [x(t)sin(e-flct) ]*h(t)}2 

j(e-flct)  , 

-  |lx(t)e  l*h(t)|2 

-  |Xt(jflc)|2,  (2.11) 
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Thus ,  when 


where  ^(j^)  is  the  STFT  evaluated  at  a  fixed  frequency  fic. 
ideal  lowpass  filter  functions  are  used  for  h(t)  and  hg^(t),  the  STFT 
magnitude  squared  is  exactly  the  same  as  the  F/D  subsystem  output  of  Fig. 
2.7c.  The  F/D  subsystem  of  Fig.  2.7c  can  therefore  be  used  to  measure  the 
STFT  magnitude  squared,  or  the  STFT  magnitude  squared  can  be  used  to 
implement  this  F/D  subsystem.  STFT  magnitude  squared  (and  therefore  STFT 
magnitude)  analysis  of  noise,  impulse,  sinusoid,  and  sinusoidal  pair 
signals  follows  directly  from  the  examples  of  Section  B.3.  Note  that  the 
parameter  0  does  not  appear  in  the  final  result  and  has  no  effect  on  the 
F/D  output. 

When  the  window  function  h(t)  is  the  impulse  response  of  a 
realizable  non-ideal  lowpass  filter,  the  F/D  subsystem  of  Fig.  2.7 c  is 
not  necessarily  equivalent  to  the  STFT  magnitude  squared  (although 
agreement  is  generally  quite  good).  For  non-ideal  lowpass  filter  window 
functions,  the  lowpass  and  high  frequency  bandpass  components  of  Equation 
2.8  overlap  in  the  frequency  domain  and  cannot  be  separated  by  any  LTI 
smoothing  filter.  Thus,  although  Equation  2.8  correctly  describes  the 
smoothing  filter  input.  Equation  2.10  becomes  an  approximate  description 
of  the  smoothing  filter  output.  Under  these  conditions  the  F/D  output  is 
approximately,  but  not  exactly,  the  same  as  the  STFT  magnitude  squared. 

It  should  be  noted  that  many  other  window  function  and  smoothing 
filter  combinations  exist  which  yield  a  F/D  output  identical  to  the  STFT 
magnitude  squared.  As  a  simple  example,  let  the  window  function  be  an 
Impulse,  h(t)*6(t).  If  the  smoothing  filter  has  impulse  response 
h8i(t)*6 (t)/[2(cos  6)^1,  cos  0*0,  then  both  the  F/D  subsystem  and  STFT 
magnitude  squared  produce  the  result  x^(t). 


2. 4. 2. 3  DISCUSSION 


In  Section  2. 4. 2. 2  it  was  shown  that  the  STFT  magnitude  squared  can 
be  used  to  implement  a  F/D  subsystem  of  the  type  shown  in  Fig.  2.7c.  The 
fixed  STFT  analysis  frequency,  £2C,  determines  the  center  frequency  of  Che 
bandpass  filter  in  the  F/D  subsystem.  Let  £1^  denote  the  one-sided  main 
lobe  bandwidth  (see  Section  B.2.1)  of  any  lowpass  window  function  h(t). 
As  long  as  the  bandpass  filter  has  a  center  frequency  which  is  greater 
than  its  bandwidth,  ie.  2^2 a »  a  l°wPass  smoothing  filter  operation  is 
effectively  implemented  by  the  STFT  magnitude  squared  computation.  The 
effective  smoothing  filter  can  be  considered  to  have  the  same  bandwidth 
as  the  bandpass  filter,  ie.  2^^*  The  window  function  thus  determines  the 
bandwidth  of  both  the  bandpass  filter  and  the  lowpass  smoothing  filter. 

There  are  many  advantages  to  implementing  a  F/D  via  the  STFT 
magnitude  squared.  The  STFT  is  widely  used,  so  literature  and  computer 
programs  are  readily  available.  Since  the  magnitude  squared  computation 
automatically  implements  an  effective  smoothing  filter,  results  may  be 
obtained  more  efficiently  than  if  a  direct  F/D  implementation  is  used. 
Since  there  are  no  delay  elements  between  tne  bandpass  filter  outputs  and 
the  adder  output  of  Fig.  2.7b,  the  effective  smoothing  filter  implemented 
by  the  STFT  magnitude  squared  has  zero  delay  regardless  of  the  window 
function  used.  When  the  STFT  magnitude  squared  is  used  to  implement  a  F/D 
subsystem,  difficulties  normally  associated  with  smoothing  filter  design 
(as  discussed  in  Section  B.2.3J  are  eliminated  and  the  output  is 
guaranteed  to  be  positive  at  all  times.  This  feature  is  desirable  for 
auditory  modeling  purposes  since  nerve  firing  rates  are  always  positive. 


F/D  implementation  via  the  STFT  magnitude  squared  has  disadvantages 
as  well.  The  STFT  magnitude  squared  does  not  generally  produce  results 
identical  to  those  produced  by  direct  F/D  subsystem  implementations. 
Design  flexibility  is  limited  since  the  F/D  bandpass  filter  must  be  of  a 
specific  type,  the  memoryless  nonlinearity  must  be  a  square  law  device, 
and  the  lowpass  smoothing  filter  must  have  the  same  bandwidth  as  the 
bandpass  filter.  Despite  these  limitations,  however,  F/D  subsystems 
implemented  via  the  STFT  magnitude  squared  are  appropriate  for  many 
applications . 


2.4.3  DISCRETE-TIME  F/D  IMPLEMENTATION  USING  STFT  MAGNITUDE  SQUARED 


For  convenience,  a  discrete-time  implementation  of  the  speech 
analysis  system  is  desired.  The  "analog"  continuous-time  theory 
presented  in  Section  2.4.2  must  therefore  be  extended  to  the  "digital" 
discrete-time  case.  One  procedure  for  transforming  an  analog  filter 
design  to  a  digital  filter  design  is  known  as  the  impulse  invariant 
method  (Oppenheim  and  Schafer  [31 J).  In  this  procedure,  the  unit-sample 
response  of  the  digital  filter  is  equally  spaced  samples  of  the  impulse 
response  of  the  analog  filter.  For  example,  if  h(t)  is  the  impulse 

response  of  an  analog  lowpass  filter,  then  the  unit-sample  response  of 
the  corresponding  digital  filter  is: 

h(n)  =  {  ti( t ) }  j  c=nT  »  (2.12) 

where  T  is  the  sampling  period.  The  continuous-time  F/D  subsystem  of 
Fig.  2.7c  can  be  transformed,  via  the  impulse  invariant  method,  into  the 
discrete-time  F/D  subsystem  of  Fig.  2.8.  The  bandpass  filter  center 
frecuency  is  and  0  is  an  arbitrary  constant  parameter. 

Let  the  window  function  h(n)  be  the  unit-sample  response  of  an  ideal 
lowpass  filter  witn  bandwidth 

h(n)  =  (sin(u)^n)  j/irn.  (2.13) 


fc 


The  Fourier  transform  of  the  window  function,  FT{h(n)} ,  is  shown  in  Fig. 
2.9a.  Spectral  regions  occupied  by  the  bandpass  filter  output  are  shown 
in  Fig.  2.9b.  The  bandpass  filter  output  is: 

y(n)  -  x(n)*[h(n)cos(wciH6  )J 


*  £  x(m)h(n-m)cos(ucn-«)ciJr+6 ) 

*  f(n)cos(u>cn)  +  g(n)sin(wcn) , 

where 

f(n)  ■  [x(n)cos(uicn-8  )]*h(n) 


(2.14) 


(2.15) 


and 

g(n)  -  [x(n)sin(«cn-6  )]*h(n).  (2*16) 

On  the  Interval  -ir<fo«r,  f(n)  and  g(n)  are  lowpaas  bandlimited  to  |u>  |<*»h- 
The  square  law  device  output  is: 

2wt(n)  -  [f 2(n)  +  g2(n) J 

+  [cos(2ucn) ] [f 2(n)-g2(n) 1  +  2[sin(2ucn) ]f (n)g(n).  (2.17) 

By  the  modulation  property,  the  function  f2(n)+g2(n)  is  lowpass 
bandlimited  (on  the  interval  -w«i><it)  to  the  region  |u>  |<2u»h,  as  shown  in 
Fig.  2.9c.  The  remaining  components  of  Equation  2.17  are  high  frequency 
bandpass  signals  which  may  be  eliminated  by  the  smoothing  filter.  Let 
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Che  smooching  fllCer  wich  unit-sample  response  hg^(n)  be  an  ideal  lowpass 
filter  having  bandwidth  Wgj: 


hsi(n)  -  [8in(u»sln)  ]/in.  (2.18) 

Also,  let  2i)^<us^<2b>c-2(i)i)  and  2i)^4)g^<2v-2uc-2ull.  It  follows  that 
2ajh^wc<'ir “^h*  output  is: 

2v0(n)  *  f2(n)  +  g2(n).  (2.19) 

Since  the  signals  x(n)  and  h(n)  are  real,  it  follows  that 


2v0(n)  =  |Xn(e  C)|2, 


(2.20) 


jwc 

where  Xn(e  )  is  the  discrete-time  STFT  evaluated  at  a  fixed  frequency 
u> £ ,  and  is  defined  as  (Rablner  and  Schafer  [3J): 


juc  »  -jwc(n-m) 

Xn(e  )  =  l  x(n-m)h(m)e 


(2.21) 


The  discrete-time  STFT  thus  follows  directly  from  application  of  the 
Impulse  invariant  transformation  to  the  continuous-time  STFT.  Note  that 
the  discrete-time  STFT  of  Equation  2.21  corresponds  to  the 

continuous -time  STFT  of  Equation  2.3,  and  the  discrete-time  F/D  result  of 
Equation  2.20  corresponds  to  the  continuous -time  result  of  Equation  2.11. 
The  discussion  of  Section  2.4.2.3  therefore  applies  to  both  the 

discrete-time  and  continuous-time  cases. 


A  difference  between  the  discrete-time  and  continuous-time 
implementation  occurs  in  the  restriction  on  bandpass  filter  center 
frequency  relative  to  bandwidth.  For  the  continuous-time  case,  it  is 
required  that  the  bandpass  filter  have  a  center  frequency  which  is 
greater  than  its  bandwidth,  ie.,  2nh<Slc.  This  restriction  also  applies 
to  the  discrete-time  case,  ie. ,  2(dh<u>c.  However,  an  additional 

restriction  must  be  applied  in  the  discrete-tine  case  because  of  the 
periodic  spectral  characteristics  shown  in  Fig.  2.9c.  An  upper  limit  must 
be  applied  to  the  digital  bandpass  filter  center  frequency,  resulting  in 
the  restriction  2wh<4)c<ir-2u)h.  In  other  words,  the  digital  bandpass 
filter  must  have  a  center  frequency  which  is  greater  than  its  bandwidth 
but  less  than  pi  minus  the  bandwidth.  The  upper  frequency  limit  is 
discussed  further  in  Section  2.4.5. 

It  can  easily  be  shown  that  if  u>c  does  not  fall  within  the  range 
2u)h<4)  c^  ~2u)h  t  then  the  STFT  magnitude  squared  does  not  implement  a  F/D 
subsystem.  For  example,  if  ojc=0  it  follows  from  Equation  2.21  that  the 
STFT  magnitude  squared  is  equivalent  to  a  lowpass  filter  with  Impulse 
response  h(n)  followed  by  a  square  law  device.  Similarly,  if  ioc***  the 
STFT  magnitude  squared  is  equivalent  to  a  highpass  filter  with  impulse 
response  (-l)nh(n)  followed  by  a  square  law  device.  In  all  cases, 
however,  the  STFT  magnitude  squared  is  a  lowpass  function  with  bandwidth 


2.4.4  THE  GENERALIZED  SHORT-TIME  FOURIER  TRANSFORM  (DISCRETE-TIME  CASE) 
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It  was  shown  in  Section  2.4.2  that  the  STFT  magnitude  squared  can  be 
used  to  implement  a  F/D  subsystem  in  which  the  bandpass  filter  bandwidth 
is  fixed  by  choice  of  the  window  function.  The  simplified  auditory 
system  model  described  in  Sections  2.2  and  2.3,  however,  uses  a  bank  of 
F/D  subsystems  in  which  each  bandpass  filter  has  a  different  bandwidth. 
Therefore,  a  generalized  version  of  the  STFT  which  allows  a  different 
window  function  at  each  analysis  frequency  must  be  used  to  implement  the 
auditory  model.  Only  the  discrete-time  case  will  be  discussed. 

Let  the  STFT  be  evaluated  at  K  discrete  arbitrarily  spaced 
frequencies  where  k=*l  ,2  , . . .  ,K.  A  different  window  function  hjc(n)  may 
be  used  at  each  frequency.  It  follows  from  Equation  2.21  that  the 
Generalized  Short-Time  Fourier  Transform  (GSTFT)  can  be  defined  as 
(Rabiner  and  Schafer  [3]): 

juj^  00  -ju^Cn-m) 

Xn(e  )  =  l  x(n-m)h^(m)e  .  (2.22) 

m=-°° 

It  is  assumed  that  the  signal  x(n)  and  the  set  of  window  functions  h^(n) 
are  real.  Several  approaches  to  GSTFT  computation  are  discussed  in 


Appendix  C 


The  GSTFT  magnitude  squared  can  be  used  to  implement  a  bank  of  F/D 
subsystems  similar  to  the  type  shown  in  Fig.  2.8.  The  resulting  bandpass 
filters  have  impulse  response  hjc(n)cos(a>]tn+6jc) ,  where  6^  is  arbitrary. 
The  bandpass  filter  with  center  frequency  ta^  has  a  bandwidth  determined 
by  h^(n).  As  long  as  the  bandpass  filter  has  a  center  frequency  which  is 
greater  than  its  bandwidth  (and  less  than  pi  minus  the  bandwidth),  a 
lowpass  smoothing  filter  operation  is  effectively  implemented  by  the 
GSTFT  magnitude  squared  computation.  The  bandpass  and  smoothing  filters 
can  be  considered  to  have  the  same  bandwidth. 


2.4.5  PERCEPTION-BASED  SPEECH  ANALYSIS  SYSTEM  IMPLEMENTATION 


Since  GSTFT  magnitude  squared  results  are  proportional  to  the 
desired  F/D  bank  outputs,  the  GSTFT  magnitude  squared  can  be  used  to 
implement  the  F/D  bank  specified  in  Sections  2.2  and  2.3.  First,  a  bank 
of  fifteen  continuous-time  bandpass  filters  must  be  designed  using  the 
critical  bandwidth  data  of  Table  2.1.  Let  each  filter  have  impulse 
response  hk(  OsinCft^t ) ,  where  £2k  is  the  center  frequency,  and  hv-(t)  is 
the  set  of  window  functions  defined  by: 

2  -akt 

hk(t)  =  ekt  e  ,  0<t 

=  0,  otherwise,  (2. 23 "i 

for  k«l ,2, . . . ,15.  The  delay  x,  as  shown  in  Fig.  2.2,  will  be  neglected. 
The  Laplace  cransforai  (see  Appendix  A)  of  each  window  function  is: 


LT'  hk(t) 

;  =  23  K's-Qlk/' 
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Frequency  domain  characteristics  of  the  windows  are  shown  in  Fig.  2.10. 
Each  window  has  a  3dB  bandwidth  of  .509a^  rad/sec.  so  each  bandpass 
filter  has  a  3dB  bandwidth  of  1.0180^  rad/sec.  Specific  values  for 
and  can  be  obtained  from  Table  2.1.  For  example.  0i“2tix250  rad/sec, 
and  a  i=(2ttx100)/1.018  sec-^. 


The  impulse  invariant  method  can  now  be  applied  to  obtain  a  digital 
implementation.  Let  T  represent  the  digital  system  sampling  period  in 
seconds.  The  window  functions  of  Equation  2.23  are  transformed  as: 


2  o  ^  ~a^nT 

n^( n )  =  (ct^)  n  T  e  ,  Kn, 


*  0,  otherwise. 


(2.26) 


where  an  additional  factor  of  T  has  been  included  to  compensate  for  the 
analog  to  digicai  transformation  (Oppenheim  and  Schafer  [31]).  The 
window  functions  of  Equation  2.26,  which  have  rational  z-transforms  (see 
Appendix  A  an.  Section  C . 2 > ,  can  be  written  in  the  form 


]_  q-,.  r  )f  i  n-r  ) 


Lto  T  J*  •*  *  •  *  *  »  •  •  **'»*•  *  to  *  .  *  •  *  »  ”.*«  ~  m  ■  »*«*■*►  '»■  •  •  *  *  . 


by  choosing 


i-1,  *k=3*  Rk=2’ 

3  -akT  3  -2a  kT 

qk(l)=(akT)  e  ,  qk(2)=(akT)  e  , 

-akT  -2akT  -3akT 

pk( 1 )=3e  ,  pk(2)=-3e  ,  and  pk(3)=e  .  (2.28) 

SubstituCion  of  the  Infinite-duration  Impulse  Response  (HR)  window 
function  defined  by  Equation  2.27  into  Equation  2.22  yields  a  recursive 
formula  for  the  GSTFT  (Rabiner  and  Schafer  [ 3 J ) : 

>k  *  k  juk  Rk  -jwk(n-r) 

Xn(e  )  =  2  pk(\|>)XnHe  )  +  l  qk(r)x(n-r)e  .  (2.29) 

i|i  =  l  r=i 

Note  that  the  recursive  GSTFT  is  computationally  efficient  for  small 
values  of  Yk  and  Rk.  An  implementation  suitable  for  real-time 
applications  is  presented  in  Section  C.2.  Values  for  ak  and 
k=l , 2 , . . . , 1 5 ,  are  obtained  from  Table  2.1  via  the  formulas 
ak=(6.  l7)(Critical  SandwidtU  in  Hz)  and  u)k=(2irT)( Center  Frequency  in  Hz). 

Each  bandpass  filter  of  the  digital  F/D  bank  implemented  via  the 
GSTFT  magnitude  squared  must  meet  two  requirements.  First,  each  filter 
must  have  a  center  frequency  which  is  greater  than  its  bandwidth. 
Second,  each  filter  must  have  a  center  frequency  less  than  pi  minus  the 
bandwidth.  Since  the  analog  filters  of  Table  2.1  meet  the  first 
requirement,  so  do  the  corresponding  digital  filters.  The  second 
requirement  depends  upon  the  sampling  period  T.  In  terms  of  analog 
tiiter  parameters,  cne  sum  of  center  frequency  and  critical  bandwidth 


(both  in  Hz)  must  be  less  than  1/2T  for  each  filter.  The  value  T=.00l)l 
second  (ie.,  a  10  KHz  sampling  rate)  is  used  to  ensure  that  the  second 
requirement  is  met. 


Fig.  2.11  shows  the  F/D  bank  response  to  an  impulse  input  applied  at 
t*.0032  sec.  The  figure  has  linear  amplitude  and  time  scales.  The  graph 
of  each  F/D  subsystem  output,  or  “channel,"  has  been  normalized  to  the 
same  peak  value.  Apart  from  a  scale  factor,  the  graphs  of  Fig.  2.11  are 
comparable  in  shape  and  duration  to  PST  histogram  envelopes  (refer  tc 
Fig.  2.4).  The  impulse  response  of  each  F/D  subsystem  is  proportional  tc 
[hk( c ) J 2 . 


When  the  F/'D  bank  input  is  a  sine  wave,  the  output  of  eacn  F/e 
subsystem  is  a  constant.  The  graphs  of  Fig.  2.12  were  obtained  from 
average  steady-state  sinusoidal  response  measurements  for  each  F.'L 
subsystem,  and  these  graphs  correspond  well  with  the  critical  bandwidth 
filter  Dank  parameters  given  in  Table  2.1.  Since  tne  F.'L  Dank  was  no: 
designee  tc  match  physiological  tuning  curves,  tne  grapns  of  Fig.  2.1.  a- 
not  possess  the  steep  SKirts  exhibited  oy  tuning  curves .  However,  it  is 
oossib.e  to  obtain  a  oetter  mater,  t:  tuning  curve  data  '  us  ir.g 


dif rerent  window  function 


FREQUENCY  (Khz) 


2.5  SHORT-TIME  ENERGY 

Shorc-Time  Energy  (STE)  is  a  quantity  which  will  prove  useful  in 
signal  synthesis,  as  described  in  Chapter  3.  For  the  discrete-time  case, 
STE  is  defined  by  (Rabiner  and  Schafer  [3]): 

00 

En  =  I  x2(m)ho(n-m) ,  (2.30) 
m=-«° 

where  hg(n)  is  the  STE  window  function.  Since  the  energy  En  must  be 
non-negative  for  all  real  sequences  x(n),  including  x(n)“6(n-ng)  for  ®nY 
integer  ng»  the  STE  window  function  must  be  non-negative;  ie. ,  hg(n)>0 
for  all  n.  Note  that  the  set  of  GSTFT  window  functions  hj^Xn), 
k=l,2,...,K,  need  not  be  non-negative  in  general. 

A  block  diagram  of  the  STE  computation  is  shown  in  Fig.  2.13.  A 
comparison  of  Figs.  2.8  and  2.13  reveals  that  the  STFT  (or  GSTFT) 
magnitude  squared  essentially  computes  the  STE  within  a  given  frequency 
band. 


The  STE  can  be  computed  recursively  if  ho(n)  has  a  rational 
z-transf orm: 


*0  R0  . 
l  jPq( *)En-^  +  ^q0(r)x2(n-r) 


(2.31) 


For  example,  let 


hg(n)  *  (agT)  n  e 


,  Kn, 


0,  otherwise. 


(2.32) 


Choosing  ag“827  results  in  a  lowpass  filter  with  a  3dB  bandwidth  of  67 
Hz.  The  set  of  coefficients  pg  and  qg  are  defined  by  Equations  2.28  with 


2.6  MINIMUM  SAMPLING  RATES 


Although  data  reduction  is  not  an  essential  part  of  an  auditory 
model,  it  is  often  desirable  to  reduce  the  amount  of  data  from  speech 
analysis  systems  for  practical  purposes.  A  modest  amount  of  data 
reduction  can  be  achieved  by  sampling  the  STE  and  F/D  bank  outputs.  If 
desired,  the  original  outputs  can  be  approximately  recovered  by  passing 
the  samples  through  an  appropriate  smoothing  filter.  A  smoothing  filter 
with  positive  impulse  response  (see  Section  B.2.3)  can  be  used  to  ensure 
that  the  upsampled  smoothed  data  is  always  positive.  This 
downsampling/upsampling  approach  is  also  known  as  decimation  and 
interpolation  (Rabiner  and  Schafer  [3]). 

The  output  of  each  F/D  subsystem  is  bandlimlted  to  twice  the  window 
function  bandwidth  for  that  subsystem,  so  each  output  must  be  sampled  at 
a  rate  which  is  greater  than  four  times  the  corresponding  window  function 
bandwidth.  The  STE  must  be  sampled  at  a  rate  which  is  greater  than  twice 
the  STE  window  function  bandwidth.  Each  output  may,  in  general,  be 


sampLed  at  a  different  rate 


Since  the  F/D  subsystems  are  implemented  via  the  GSTFT  magnitude 
squared,  the  sampling  rates  for  F/D  subsystem  outputs  also  apply  to  GSTFT 
magnitude  squared  functions.  Once  the  GSTFT  magnitude  squared  has  been 
sampled,  any  invertible  operation  such  as  square  root  or  logarithm  can  be 
applied  to  the  data.  For  non-negative  numbers,  knowledge  of  the  square 
root  or  logarithm  of  a  number  is  the  same  as  knowledge  of  the  number 
itself.  It  follows  from  the  results  of  Sections  B.3.6  and  B.3.8  that 
GSTFT  magnitude  (as  opposed  to  magnitude  squared)  functions  are  not 
bandlimited  in  general.  The  minimum  sampling  rate  is  therefore 
determined  by  the  magnitude  squared  functions,  but  is  equally  applicable 
to  magnitude  or  log  magnitude  functions,  even  though  such  functions  may 
not  be  bandlimited. 

Note  that  the  minimum  sampling  rate  requirements  were  derived  from 
system  theory  considerations,  and  conditions  for  reconstruction  of  the 
original  signal  from  the  GSTFT  magnitude  data  are  not  considered  in  this 
chapter  (see  Chapter  3).  The  minimum  sampling  rate  arises  when  each 
channel  is  examined  independently,  and  a  sampling  rate  is  determined 
which  accurately  preserves  all  available  information  in  each  channel. 
When  the  complete  analysis  system  is  considered,  however,  channels  may 
overlap  and  contain  redundant  information.  The  overall  sampling  rate 
required  for  signal  reconstruction  may  therefore  be  less  than  the  product 
of  the  number  of  channels  and  the  sampling  rate  per  channel  determined 


2.7  CONCLUSION 


In  this  chapter,  it  was  shown  that  the  peripheral  auditory  system 
can  be  roughly  modeled  as  a  F/D  bank.  To  obtain  a  speech  analysis  system 
based  on  perception,  a  model  structure  determined  by  physiological  data 
from  animals  was  combined  with  model  parameters  determined  by  perceptual 
experiments  performed  on  humans.  It  was  shown,  via  a  new  relationship, 
that  a  F/D  subsystem  of  the  desired  type  can  be  implemented  using  the 
STFT  magnitude  squared.  Further  applications  of  this  relationship  are 
described  in  Appendix  D.  A  generalized  version  of  the  STFT  magnitude 
squared  was  used  to  implement  the  speech  analysis  system  based  on  the 
simplified  auditory  model,  and  a  STE  function  was  also  computed.  Minimum 
sampling  rates  for  the  STE  and  F/D  outputs  have  been  specified. 


CHAPTER  3 


SPEECH  SYNTHESIS  SYSTEM 


3.1  INTRODUCTION 

This  chapter  describes  a  speech  synthesis  system  which  reconstructs 
a  signal  from  spectral  magnitude  data,  as  provided  by  the  analysis  system 
of  Chapter  2.  Apart  from  an  overall  sign  factor,  the  synthesis  system 
can  obtain  exact  signal  reconstruction  in  the  absence  of  data 
modification.  It  will  be  shown  in  Chapter  4  that  the  system  also 
performs  well  given  modified  data.  Only  the  discrete-time  case  will  be 


discussed 


The  overall  speech  analysis/synthesis  system  is  depicted  in  Fig. 
3.1.  The  signal  x(n)  is  analyzed  by  a  Filter/Detector  (F/D)  bank,  which 
is  implemented  via  the  Generalized  Short-Time  Fourier  Transform  (GSTFT) 
magnitude  squared  as  described  in  Chapter  2.  An  optional  Short-Time 
Energy  (STE)  constraint  may  also  be  computed.  The  GSTFT  magnitude 

squared  and  STE  values  are  subjected  to  an  analysis  transformation  A. 
The  analysis  transformation  may  consist  of  lowpass  filtering, 
downsampling,  logarithmic  operations,  or  a  variety  of  processes  such  as 
principal  components  analysis  (Chu  [16]).  The  analysis  transformation 
may  also  include  a  delay  in  each  channel  which  allows  the  impulse 
responses  of  Fig.  2.11  to  attain  their  peak  values  simultaneously.  Such 
delays  are  useful  for  data  display  purposes.  The  resulting  data  is  sent 
through  a  transmission  channel.  At  the  channel  output,  received  values 
are  subjected  to  a  synthesis  transformation  S.  The  synthesis 
transformation  may  consist  of  exponentiation,  upsampling,  lowpass 
filtering,  or  other  operations.  It  will  be  assumed  that  the  synthesis 
transformation  attempts  to  reverse  effects  of  the  analysis 
transformation.  Thus,  the  synthesis  transformation  produces  modified 
data  values  which  approximate  the  original  values,  ie. , 

~J0)k2  j°)k2~ 

| Xn(e  )|  =|Xn(e  )|  and  En3En.  Finally,  a  sequence  x(n)  is 

ju>k 

reconstructed  from  the  modified  data.  Let  Xn(e  )  denote  the  GSTFT  and 

A  A  A 

En  denote  the  STE  of  x(n).  The  sequence  x(n)  may  be  reconstructed  by 

“  J“k  2 

choosing  values  so  that  |Xn(e  )|  and  En  match  the  available  data 


Figure  3.1:  Overall  Speech  Analysis/Synthesis  System 


|Xn(e  )|  and  En.  In  this  sense,  the  reconstructed  signal  x(n) 
approximates  the  original  signal  x(n).  The  reconstruction  process  is 
illustrated  in  Fig.  3.2.  Note  that  the  reconstruction  process  contains  a 
model  of  the  analysis  system.  Signal  generation  is  accomplished  by  a  set 
of  equations  which  will  be  derived  in  Section  3.3.2. 


SIGNAL 

GENERATION 


- 1 

|GSTFT| 
and  STE 


-  o 

|Xn(e  )|  and  E„ 
(ESTIMATES) 


~  ,2 

|  Xn(e  )|  and  E, 

(DATA) 


3.2  ANALYSIS/SYNTHESIS  SYSTEM  DESIGN  GUIDELINES 


Although  a  signal  can  theoretically  be  recovered  from  the  unmodified 
GSTFT  magnitude  squared  (as  will  be  shown  in  Section  3.3),  several 
guidelines  must  be  applied  to  design  a  practical  analysis/synthesis 
system.  These  guidelines  are  a  consequence  of  the  F/D  and  GSTFT  magnitude 
squared  equivalence  described  in  Chapter  2. 


3.2.1  SHORT-TIME  ENERGY 

Under  certain  conditions,  unwanted  out-of-band  components  may  be 
introduced  by  the  reconstruction  process  if  STE  is  not  used.  To 
illustrate,  let  the  F/D  bank  analyzer  of  Fig.  3.1  examine  the  200-3675  Hz 
frequency  region.  Information  about  other  frequency  components  of  x(n) 

ju>k  2 

is  not  transmitted  through  the  channel.  Assume  |Xn(e  )|  exactly 

>k  2 

matches  the  data  |Xn(e  )|  .  For  reconstruction  based  on  magnitude 

information  alone,  nothing  prevents  the  reconstructed  signal  x(n)  from 
having  large  components  at  low  frequencies  (below  200  Hz)  or  high 
frequencies  (above  3675  Hz).  Such  components  could  be  eliminated  by 

bandpass  filtering  x(n),  but  the  reconstruction  process  must  then  employ 
a  wide  dynamic  range  to  maintain  a  small  signal  with  an  arbitrarily  large 
offset. 

Although  there  are  many  ways  to  eliminate  out-of-band  components 
from  the  reconstructed  signal,  use  of  STE  has  proven  most  practical.  As 
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long  as  the  original  signal  x(n)  has  been  bandpass  filtered  to  reject 
components  outside  the  F/D  bank  analysis  range,  the  STE  constraint 
prevents  out-of-band  components  from  entering  the  reconstruction  process. 
In  cases  where  little  data  modification  is  ir  olved,  the  STE  constraint 
is  unnecessary  if  some  information  about  out-of-band  components  is 
allowed  through  the  transmission  channel.  This  occurs  when  non-ideal 
bandpass  filters  are  used  (refer  to  Fig.  2.12  for  examples  of  appropriate 
filter  characteristics).  Thus,  it  is  possible  to  achieve  exact  signal 
reconstruction  without  STE  (see  Anderson  and  Searle  [32]  for  examples). 
For  most  practical  applications,  however,  use  of  STE  is  recommended. 


3.2.2  BANDPASS  FILTER  CHARACTERISTICS 

If  the  bandpass  filter  frequency-domain  characteristics  do  not  meet 
certain  overlap  and  shape  requirements,  then  practical  signal 
reconstruction  is  impossible.  For  example,  consider  a  bank  consisting  of 
two  F/D  subsystems.  Let  the  bandpass  filters  have  non-overlapping 
frequency  characteristics  as  shown  in  Fig.  3.3a.  Assume  that  a 
sinusoidal  tone  burst  in  the  350-400  Hz  frequency  range  is  fed  into  the 
F/D  bank,  and  the  tone  burst  is  of  sufficient  duration  that  the  F/D 
outputs  reach  a  steady-state  value.  In  the  steady-state  condition,  one 
F/D  output  is  a  constant  positive  value  while  the  other  is  essentially 
zero  (see  Section  B.3.7).  The  amplitude  and  frequency  of  the  tone  burst, 
which  are  two  independent  parameters  of  interest,  cannot  be  determined 
from  the  single  non-zero  F/D  output  even  if  the  filter  characteristics 
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are  known.  Steady-state  F/D  outputs  will  be  identical  for  a  variety  of 
input  amplitudes  and  frequencies.  It  can  only  be  determined  that  the 
sinusoidal  input  frequency  lies  within  a  particular  filter  passband,  and 
the  amplitude  is  indeterminite.  In  theory,  exact  signal  reconstruction 
can  be  achieved  from  unmodified  F/D  outputs  if  transient  as  well  as 
steady-state  values  are  examined.  However,  slight  modifications  (such  as 
truncation  error)  are  always  present  in  actual  systems,  and  practical 
reconstruction  cannot  be  achieved  from  F/D  banks  using  non-overlapped 
bandpass  filters. 

When  overlapping  bandpass  filter  characteristics  are  used,  as  shown 
in  Fig.  3.3b,  amplitude  and  frequency  of  an  input  sinusoid  can  easily  be 
determined  from  the  steady-state  F/D  outputs.  For  example,  the  frequency 
can  be  obtained  from  a  ratio  of  the  F/D  outputs.  The  amplitude  can  then 
be  determined  from  either  bandpass  filter  characteristic.  Thus,  it  is 
not  necessary  to  rely  on  transient  or  low-level  components  to  achieve 
reconstruction  when  overlapped  filters  are  used.  The  need  for  overlapped 
filters  in  speech  analysis  systems  has  also  been  noted  by  Klatt  [33J. 

It  should  be  noted  that  bandpass  filter  overlap  is  necessary,  but 
not  sufficient,  for  a  practical  analysis/synthesis  system.  For  example, 
if  the  bandpass  filters  are  overlapped  but  possess  both  constant  passband 
gain  and  steep  skirts,  then  steady-state  F/D  outputs  will  be  identical 
for  a  range  of  tone  burst  frequencies.  The  speech  analysis  system  based 
on  perception  uses  overlapped  filters  which  do  not  have  constant  passband 
gain  (see  Fig.  2.12).  It  will  be  demonstrated  in  Chapter  4  that  such  a 
configuration  performs  well  in  a  practical  analysis/synthesis  system. 


3.2.3  TRANSMISSION  CHANNEL  DATA  RATE 


Since  the  transmission  channel  data  rate  (see  Fig.  3.1)  must  be 
chosen  in  accordance  with  the  bandpass  filter  characteristics,  this  rate 
is  affected  by  the  filter  overlap  requirement.  For  example,  assume  that 


the 

original 

signal  is  sampled  at 

a  10  KHz 

rate,  and 

a 

non-over lapped 

bank 

of  bandpass 

filters  is  used 

to  cover 

the  full  0- 

-5 

KHz  frequency 

range 

.  Each 

F/D 

subsystem  output 

mus  t  be 

sampled  at 

a 

rate  which  is 

twice  the  associated  bandpass  filter  bandwidth  (see  Section  2.6),  leading 
to  an  overall  data  rate  of  10  KHz  (not  including  STE).  When  an 
overlapped  filter  bank  is  used,  the  required  minimum  transmission  channel 
data  rate  is  doubled  (ie.,  20  KHz).  Of  course,  if  the  full  range  of 
possible  frequencies  is  not  covered  by  the  F/D  bank,  then  the  required 
transmission  channel  data  rate  is  correspondingly  less. 

It  should  be  noted  that  the  transmission  channel  data  rate  discussed 
in  this  section  is  based  on  system  theory  considerations,  and  does  not 
consider  the  possibility  of  efficient  waveform  encoding  to  achieve  data 
reduction.  Data  reduction  is  discussed  in  Section  5.3. 


3.3  GENERAL  SYNTHESIS  EQUATIONS 


In  this  section,  it  is  shown  that  (apart  from  an  overall  sign 
factor)  right-sided  sequences  can  be  exactly  reconstructed  from  the  GSTFT 
magnitude  squared.  Left-sided  sequences  can  similarly  be  reconstructed 
when  appropriate  initial  conditions  are  specified.  The  algorithms 
presented  in  this  section  are  theoretically  capable  of  performing  signal 
reconstruction  whether  or  not  the  practical  guidelines  of  Section  3.2  are 
followed.  Thus,  in  order  to  obtain  a  practical  analysis/synthesis  system, 
the  guidelines  of  Section  3.2  are  a  prerequisite  to  application  of  the 
reconstruction  algorithms.  Note  that  synthesis  equations  of  a  general 
form  are  derived  in  this  section.  The  procedure  by  which  these  equations 
are  applied  to  a  specific  case  is  described  in  Section  3.4. 


3.3.1  PLAUSIBILITY  ARGUMENT 


A  simple  approach  described  by  Nawab,  Quatieri,  and  Lira  134]  can  be 
used  to  recover  a  sequence  from  its  GSTFT  magnitude  squared.  Although 
this  approach  does  not  employ  the  reconstruction  process  depicted  in  Fig. 
3.2,  it  serves  to  illustrate  the  issues  involved  in  signal  reconstruction 
from  magnitude  and  to  motivate  the  practical  approach  presented  in 
Section  3.3.2.  STE  will  not  be  used  in  this  section. 

Assume  that  two  different  F/D  subsystems  are  implemented  via  the 
GSTFT  and  each  Finite-duration  Impulse  Response  (FIR)  window  function 
hk(n),  k.=  l  or  2,  is  nonzero  only  for  (Kn<Mk~ 1.  It  follows  from  Equation 
2.22  that  the  GSTFT  magnitude  squared  can  be  written  as: 


!*n(e 


J“V 


akx2(n)  +  bk(n)x(n)  +  cR(n), 


(3.1) 


where 


ak  =  lhk(0)]2, 


(3.2) 


Mj^l 

bk( n)  *  2hk(0)  I  x(n-m)hk(m)cos(u)km), 

m=l 


(3.3) 


and 


ck(n) 


M^l  2 

|  i  x(n-m)hk(m)e  ] 
m*l 


Therefore , 


(3.4) 
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x(n)  »  (1/23^)! -bk(n)  +  /  l bk ( n ) J  -  4ak[ck(n)-|Xn(e  )|  ]}  (3.5) 


Note  that  care  must  be  taken  to  ensure  the  quantity  under  the  square  root 
sign  is  always  positive. 

To  Illustrate  the  signal  reconstruction  process,  assume  x(n)=0  for 
n<0.  It  follows  that  bic(0)=cjt(0)*0,  and 

J“>k 

x(0)  ■+  |X0(e  ))/hjt(0).  (3.6) 

Thus  the  output  from  either  F/D  subsystem  may  be  used  to  determine  the 
first  reconstructed  value  within  a  sign  factor.  The  positive  value  for 
x(0)  may  be  arbitrarily  chosen,  as  choice  of  the  negative  value  only 
changes  the  reconstructed  sequence  by  an  overall  sign  factor.  Given  the 
value  of  x(0),  values  of  bfc(l)  and  c^O)  can  be  computed.  Note  that 
bfc(n)  and  cjjXn)  are  always  computed  using  previously  reconstructed  signal 

j“k  2 

values.  Given  |Xj(e  )|  for  two  F/D  subsystems  appropriately  spaced  in 
frequency,  the  value  of  x(l)  can  be  determined  using  Equation  3.5.  Each 
of  the  two  F/D  subsystems  yields  two  possible  values  for  x(l),  and  the 
ambiguity  is  resolved  by  choosing  the  solution  which  is  consistent  with 
both  F/D  outputs.  Given  x(0)  and  x(l),  the  value  for  x(2)  can  be 
determined,  and  so  forth  to  reconstruct  the  entire  sequence. 

This  simple  reconstruction  algorithm  is  subject  to  many  practical 
difficulties.  First  of  all,  the  reconstructed  sequence  may  not  be 


unique.  Recall  that  the  window  functions  h^Cn),  k=l  or  2,  are  nonzero  for 


(otherwise,  uniqueness  problems  may  be  caused  by  "gaps"  in  the 
window  function,  as  described  by  Nawab  [35)).  The  reconstructed  sequence 
is  unique,  to  within  an  overalL  sign  factor,  unless  a  sequence  of  zero 


values  having  length  or  more  is  encountered  in  the  data.  A  sign 
ambiguity  is  introduced  whenever  such  a  sequence  of  zeros  is  encountered. 
For  example.  Fig.  3.4  shows  four  possible  reconstructed  sequences  which 
can  result  when  the  two  window  functions  are  of  length  four  or  less. 
Studies  suggest  that  such  effects  may  not  be  important  for  speech  if  the 
analysis  uses  at  least  two  F/D  subsystems  with  impulse  response  duration 
of  10  milliseconds  or  more  (Warren  and  Wrightson  [3bJ;  Flanagan  and 
Guttman  [37 j).  In  any  case,  the  multiple  sign  ambiguity  problem  can  be 
alleviated  by  use  of  Infinite-duration  Impulse  Response  (IIR)  windows. 

Anotner  problem  with  the  simple  algorithm  is  its  inability  to 
perform  recons truction  from  modified  data.  Slight  modifications  such  as 
truncation  error  can  cause  the  two  F/D  outputs  to  produce  contradictory 
resuits.  For  example,  if  data  from  one  F/D  indicates  that  x(l)=-2.UG0  or 
.919  while  another  F/D  indicates  x(l)=-2.002  or  2.332,  then  no  consistent 
solution  for  the  value  of  x(L)  exists.  It  may  be  desirable,  however,  to 
use  the  value  x( 1 ) *-2.001  for  future  computations,  although  tnis  value 
must  be  chosen  by  some  algorithm  which  processes  inconsistent  results. 
Such  inconsistent  results  can  be  treated  in  an  organized  manner  ov 
:er  ir.i..g  an  error  enter  to  l,  as  vil  .  oe  snow:,  n  Section  :.  o._.  .ci  -:r:  or 


:r iterion  t>  used  to  determine  the  caoice  of  an  nopr  op n ate  real 


Next,  note  that  the  simple  algorithm  reconstructs  x(n)  given 

,  J“k  ,2 

|Xn(e  )|  ,  ignoring  information  about  x(n)  contained  in  the  future  data 

,2 

IXgjte  )|  for  m=n+l , . . .  .n-Hl^-l.  This  observation  suggests  that  an 
algorithm  using  non-causal  processing,  such  as  filtering  with  delay,  may 
achieve  superior  results.  For  example,  consider  the  reconstruction 
process  of  Fig.  3.2  which  uses  an  error  criterion.  Assume  that,  once 
reconstructed,  the  value  of  a  point  is  held  constant.  The  feedback 

A 

system  of  Fig.  3.2  reduces  the  error  by  changing  only  the  value  of  x(n) 
at  one  specific  time  n.  If  previous  points  were  not  reconstructed 
exactly,  the  system  attempts  to  compensate  by  changing  the  value  of 

A 

x(n)  accordingly.  Such  a  change  may  lead  to  further  cumulative  error, 
causing  poor  reconstruction.  However,  if  previously  reconstructed  values 
can  be  modified  on  the  basis  of  new  information,  the  error  can  be 
distributed  among  a  large  number  of  points  and  reconstruction  is 
improved.  This  non-causal  approach  is  especially  useful  for 
reconstruction  from  modified  spectra. 

Note  that  since  the  simple  reconstruction  algorithm  achieves  exact 
reconstruction  from  only  two  F/D  channel  outputs,  an  overall  data  rate 
which  is  at  least  twice  the  sampling  rate  of  x(n)  can  be  used  in  the 
transmission  channel  of  Fig.  3.1.  This  result  is  the  same  as  that  derived 
in  Section  3.2.3.  For  data  rates  less  than  twice  the  sampling  rate,  exact 
reconstruction  cannot  be  achieved  in  general  (see  Theorem  2.3  of  Nawab 


3.3.2  EQUATION’S  FOR  PRACTICAL  SIGNAL  RECONSTRUCTION 

The  simple  reconstruction  algorithm  of  Section  3.3.1  can  be  modified 
to  obtain  the  practical  algorithm  shown  in  Fig.  3.2.  The  required 
modifications  include  use  of  an  error  criterion  and  non-causal 
processing. 

To  develop  a  practical  algorithm,  the  GSTFT  is  rewritten  in  a  more 
convenient  form.  For  any  integers  l  and  y  it  follows  from  Equation  2.22 
that: 

-  ju)k  *  -jwk(n-Y) 

Xn_£  (e  )  =  x(n^Y  )hk(y-A  )e 

o  -ju)k(n-£-m) 

+  £  x(n-£ -m)hk(m)e  ,  (3.7) 

-£ 

where  the  summation  over  n*y-l  is  defined  as  the  summation  from  minus 
infinity  toy-f,-l,  plus  the  summation  from  y-£.+l  to  infinity.  Taking  the 
magnitude  squared  of  Equation  3.7  yields: 

«  jo)k  2  "  *  *  A 

|Xn-t (e  )|  *  akx2(n-y)  +  bk(n)x(n-y)  +  ck(n),  (3.8) 

where 

A 

ak  =  (hk(y “A  >  J 2 ,  (3.9) 


bk(n)  *  2hk(y-A)  £  x(n-£.  -m)hk(m)cos  [u^m-y+I )  ]  , 

m*y  -l 


and 


(3.10) 


67 


(3.11) 


•»  *  jto  j^in  2 

ck(n)  =  I  1  x(n-£  -m)hjc(m)e  | 

m*Y  -i 

The  GSTFT  magnitude  squared  at  any  time  n-£  can  thus  be  expressed  as  a 

A  A 

quadratic  function  of  the  sequence  at  any  time  n-*y .  Note  that  a^,  b^n), 

A  A  A 

and  c^Cn)  are  independent  of  x(n~y).  However,  if  a  value  of  x(n-y)  and 

>k 

its  corresponding  value  of  xn~£(e  )  are  known,  then  it  is  easily 
verified  trora  Equations  3.7  and  3.10  that: 

A  jui^n-hf)  A  j“>k 

b^Cn)  =  2hlc(y  -SL )  ( Re{  e  Xn_£(e  )}  ~  x(n~Y  )hjc('y-i )  j .  (3.12) 

Also,  it  follows  from  Equation  3.8  that: 

A  *  j(Dk  2  A  A  A 

ck(n)  =  |Xn_£(e  )|  -  alcx2(n-y)  -  bk(n)x(n-y).  (3.13) 

Equations  3.12  and  3.13  can  often  be  computed  more  easily  than  Equations 
3.10  and  3.11. 

Using  a  similar  approach  for  STE,  it  follows  from  Equation  2.30 

that : 

A  A  A  A 

En-£  =  3QX2(n-Y)  +  CQ(n),  f  3 . 1  4  .> 

where 

-  =  i-.  .  *  v  , 

d  • 1  [ ) *  **  f  .  .  : 


an^ 


(3.16) 


c0(n)  =  I  ho(m)x2(n-£-m). 

m ty-l 

Therefore , 

A  A  A  A 

c0(n)  =  En_^  -  a0x2(n-y)  (3.17) 

For  convenience,  a  weighted  mean  squared  error  criterion  is  chosen. 
The  error  is  defined  as: 

00  A  ^  £ 

e(n)  -  l  {(En_*  -  En_*)  W0U) 

K  J<*>k  2  2  2 

+  I  l|Xn-£(e  )|  -  I Xn-£ ( e  )|  j  Wk(JO},  (3.18) 

k=l 

where  WmU ) ,  db=0, 1 , . . .  ,K,  and  is  a  weighting  function.  The 

weighting  function  specifies  which  data  values  contribute  to  the  error  at 
time  n.  Although  weighting  functions  which  vary  with  time  or  signal 
level  can  be  used,  such  functions  will  not  be  considered  here.  When  the 
weighting  function  is  constant  for  all  values  of  fe,  the  error  is 
e(n)=etotai  where  etotal  a  constant  total  error  independent  of  n.  Any 

A 

reconstructed  sequence  x(n)  which  minimizes  e total  an  "optimum 

solution  in  the  mean  squared  error  sense.  In  order  to  achieve  reasonable 
results  with  less  computation,  a  "local"  sub-optimum  error  criterion  may 


be  preferable  to  the  "global"  optimum  error  criterion 


A  local  error 


criterion  can  be  obtained  by  choosing  a  weighting  function  which  is 
narrow  in  the  £  dimension.  Even  when  a  sub-optimum  error  criterion  is 
chosen,  error  minimization  may  require  solution  of  an  infinite  number  of 
simultaneous  nonlinear  equations  if  the  window  functions  are  infinitely 
long.  It  is  not  generally  possible  to  solve  such  a  set  of  equations  with 
a  finite  amount  of  computation. 

A  practical  sub-optimum  approach  to  signal  reconstruction,  which 
avoids  the  problem  of  solving  simultaneous  nonlinear  equations,  can  be 
obtained  by  holding  all  synthesized  values  constant  with  the  exception  of 

A  A 

x(n-y).  An  appropriate  value  of  x(n-y)  can  then  be  determined  by 
substituting  Equations  3.8  and  3.14  into  3.18  and  setting 

A 

3e (n)/9 x(n-y )=0.  Under  these  conditions,  e (n)  is  reduced  by  choosing 

A 

x(n— y )  as  a  root  of: 

U3X^(n-y)  +  u2x2(n-y)  +  ujx(n-'Y)  +  uq  ■  0,  (3.19) 

where 

ao  a  K  * 

u3  -  2  l  l(a0)2W0U)  +  l  (ak)2WkU)],  (3.20) 

£=-“>  k=*l 

00  A  A 

u2  =  3  l  l  akbk(n)wk(£),  (3.21) 

£=-*>  k=l 


ui  =  2  l  a0[co(n)  -  E^JWqU) 
l  =-“> 


00  K  o  *  *  ~  Jwlf 

+  .1  j  {[bk(ti)J  +  2ak(ck(n)  -  |Xn_£(e  )| 

k=l 


]}wk(£),  (3.22) 


and 


00  K  -  *  ~  jwk  2 

u0  =  l  l  bk(n)[ck(n)  -  ^.(e  )|  jWk(£).  (3.23) 

£=-<»  k=l 

Values  for  the  reconstructed  sequence  can  be  generated  by  solving 
Equation  3.19  for  the  roots  of  a  cubic  expression.  The  real  root  which 
yields  the  smallest  value  of  e(n)  is  chosen  as  the  sequence  value.  Since 
cubics  have  one,  two,  or  three  distinct  real  roots,  a  real  sequence  value 
can  always  be  found  which  satisfies  the  error  criterion.  Furthermore,  a 
closed-form  solution  exists  for  computing  the  roots  (CRC  Standard 
Mathematical  Tables  [ 38 J ) .  In  the  actual  implementation,  double-precision 
computer  arithmetic  was  used  to  obtain  an  accurate  solution  of  the  cubic 
expression.  Such  accuracy,  however,  is  not  required  elsewhere  in  the 


reconstruction  algorithm 


3.4  SPEECH  SYNTHESIS  PROCEDURE 


Equations  of  Section  3.3.2  can  be  used  to  reconstruct  speech  signals 
from  data  produced  by  the  analysis  system  of  Chapter  2.  Although  the 
equations  may  be  applied  in  many  different  ways,  only  one  approach  will 
be  described  in  detail.  This  approach  has  been  used  to  generate  a  number 
of  examples,  which  are  presented  in  Chapter  4. 

From  Equations  2.22  and  2.27  it  follows  that  Xn(e  )  contains 

information  about  x(n^y )  for  where  i=l  for  the  present  application. 

For  practical  purposes,  however,  it  is  assumed  that  Xn(e  )  contains 
significant  information  about  x(n-y)  only  for  the  finite  set  of  values 
i<Y<Yina''»  where  Ymax  is  some  arbitrary  integer.  Therefore,  if  the  values 

of  x(n-y)  for  i«y<Y max  are  changed  during  the  reconstruction  process, 

j“k 

then  the  values  of  Xn_£(e  )  for  tK^Ymax-*  must  also  be  changed 

accordingly.  The  value  Ymax=^^»  which  results  in  a  2  millisecond 

synthesis  window,  will  be  used  throughout.  Note  that  this  value  is  not 

critical.  Small  values  (Ymax=^  can  usec*  to  rapidly  obtain  exact 
reconstruction  from  unmodified  spectral  data,  while  large  values 
commensurate  with  the  maximum  effective  window  length  (Ymax=^0) 
improve  the  quality  of  reconstruction  from  modified  data. 

To  completely  specify  the  reconstruction  error  criterion,  an 
appropriate  weighting  function  Wm(£),  m=0,l,...,K,  and  must  be 

chosen.  The  F/D  outputs  are  bandlimited  functions  which  do  not  generally 
change  rapidly.  Therefore,  the  weighting  function  can  be  chosen  narrow 
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in  the  l  dimension.  A  weighting  function  which  is  wide  in  the  l 
dimension  may  be  advantageous  for  reconstruction  from  highly  modified 
data,  but  causes  an  increase  in  computation  time  and  implementation 
complexity.  Since  the  bandpass  filters  have  normalized  gains  as 
described  in  Section  2.4.5,  and  are  roughly  of  equal  importance  for 
speech  intelligibility  (Beranek  [39]),  the  F/D  weighting  coefficients  are 
equal.  Let  Wjc(£)=l  for  £=0  and  k=l,...,K,  and  Wjc(Jl)=0  otherwise.  With 
this  choice  of  F/D  weighting  coefficients,  an  empirically  determined 
energy  weight  Wo(£)=.03  for  £=0,  Wg(il)=0  otherwise,  is  appropriate.  The 
energy  weight  is  small  because  energy  values  are  often  large,  and  also 
because  the  energy  function  is  intended  as  a  constraint  and  not  as  an 
information-bearing  element.  The  resulting  error  expression  is: 

~  2  £  -  2  jwk  2  2 

e(n)  =  (En  -  En)  WQ  +  £  [|Xn(e  )|  -  |Xn(e  )|  ]  ,  (3.24) 

k=l 

where  Wq=.03  and  K=15.  The  error  given  by  Equation  3.24  is  used  for  all 
reconstruction  examples  of  Chapter  4  (see  Anderson  and  Searle  [32]  for 
examples  using  a  different  weighting  function).  The  total  error  can  be 
computed  as: 

00 

e  total  =  l  e(n)*  (3.25) 

n=-«o 

For  comparison  purposes,  it  is  useful  to  define  a  total  error  which  is 
normalized  with  respect  to  the  original  signal: 

"  K  jojjt  4 

etotal,norm  ”  etotal/{  1  K^n^w0  +  I  l*n^e  ^  (3.26) 

n*-®  k“l 


The  total  normalized  error  does  not  change  with  input  signal  level,  and 
provides  a  form  of  error-to-signal  ratio  (Griffin,  Deadrick,  and  Lim 
[40]). 

The  synthesis  procedure  will  now  be  described  in  detail.  For 

A 

convenience,  assume  x(n)=0  for  n<0.  The  reconstructed  sequence  x(n), 
*  ju)k 

estimated  GST FT  Xn(e  ),  and  estimated  STE  En  are  initially  set  to  zero 
for  all  n.  The  synthesis  procedure  begins  at  any  time  n<i,  where  i*l  for 
the  present  application.  The  index  n  is  incremented  one  point  at  a  time, 
and  each  newly  reconstructed  point  is  used  to  update  previously 
reconstructed  values. 

The  first  reconstruction  step  advances  the  time  index,  and  updates 
estimated  GSTFT  and  STE  values  based  on  available  reconstructed  sequence 
values.  Previously  calculated  GSTFT  and  STE  values  which  are  unaffected 

by  any  changes  in  x(n-y),  i<Y<Ymax»  are  used  as  initial  conditions  for 

A 

the  update.  Equations  2.29,  2.31,  and  the  present  values  of  x(n~Y)  for 
Y> i  are  used  to  generate  GSTFT  and  STE  estimates  up  to  time  n. 

A 

The  present  estimated  value  of  x(n-i),  which  was  set  to  zero  during 
initialization,  is  likely  to  be  in  error.  An  improved  estimate  for 

A 

x(n-i)  is  obtained  by  a  procedure  which  will  be  described  shortly. 

A 

Improving  the  estimate  for  x(n-i)  provides  a  reconstructed  sequence 
value.  Using  this  improved  value,  new  values  for  previously 
reconstructed  points  can  also  be  determined. 


Due  Co  che  shape  of  Che  window  functions,  which  have  small  inicial 
values  as  shown  in  Fig.  2.11,  many  refinemencs  are  necessary  in  Che 

A 

escimaCes  of  x(n-y)  for  small  y.  To  make  refinemencs,  all  poincs  ocher 
Chan  one  specified  poinc  are  held  consCanc,  and  Che  specified  poinc  is 
allowed  Co  vary  in  a  fashion  which  reduces  Che  error.  Thus,  adjuscmencs 


to  x(n-y)  for  large 

Y 

must  not 

be  made 

until  the  more  recently 

reconstructed 

points 

are 

thoroughly 

corrected. 

Estimated  values  of 

the 

recons trucced 

points 

must 

therefore 

be  refined 

in  a  certain  order. 

To 

develop  the 

examples 

shown  in  Chapter  4, 

the  following  order 

of 

refinemenC  in  values  of  x(n-y)  was  used: 

y  «  1,2, 1,2, 1,2, 1,2, 1,2, 

1,2, 3, 1,2, 3, 1,2, 3, 1,2, 3, 1,2, 3, 

eCc. , 

1.. .. .7.1... ..7.1.. ...7.1.... .7.1... ..7, 

1.. ...8.1.....9.1.....10,ecc.,l,...,Ymax,  (3.27) 

where  Ymax~20.  After  chis  procedure  has  been  performed  co  reconscrucc 

A 

one  new  poinc  x(n-i)  and  adjusc  values  of  previously  reconscrucCed  poincs 

A 

chrough  xCn-^max),  Che  Cime  index  is  incremenCed,  GSTFT  and  STE  estimates 
are  updated  based  on  the  new  r  quence  values ,  and  che  procedure  is 
repeated  to  reconstruct  the  entire  sequence. 


5 


To  refine  the  estimate  of  any  point  xCn-y),  Equations  3.12,  3.13, 


and  3.17  are  used  with  £= 0  to  obtain  b^Cn),  ck(n),  and  Cg(n).  Note  that 

A  A 

ag  and  ak  are  pre-computed  constants  which  do  not  depend  on  the  data. 

A 

The  contribution  of  the  present  sequence  estimate  x(n-y)  is  now 

*  jwk 

subtracted  from  the  present  GSTFT  estimates  Xn(e  )  and  STE  estimate 

En  by  using  Equations  3.7  and  3.14.  Next,  uq,  u^,  and  U2  are  computed 

from  Equations  3.21,  3.22,  and  3.23.  Note  that  U3  can  be  pre-computed, 
as  shown  in  Equation  3.20.  Equation  3.19  is  solved,  resulting  in  up  to 

three  new  candidate  estimates  for  x(n-y).  The  first  candidate  is 

evaluated  by  adding  its  contribution  to  the  GSTFT  and  STE  estimates 

using  Equations  3.7  and  3.14.  The  resulting  error  is  evaluated  using 
Equation  3.24.  A  similar  procedure  is  applied  to  each  remai  ing 
candidate,  the  one  producing  minimum  error  is  chosen  as  the  new  estimated 

A 

value  for  x(n-y)>  and  the  corresponding  GSTFT  and  STE  estimates  are 
retained.  Note  that,  for  a  fixed  time  n,  the  error  is  reduced  with  each 
application  of  this  procedure. 
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Note  that  the  algorithm  described  in  this  section  can  be  applied  to 
reconstruction  of  right-sided  sequences  or  other  types  of  sequences  for 
which  appropriate  initial  conditions  have  been  specified.  If  necessary, 
however,  initial  conditions  may  be  generated  by  repeated  application  of 
the  reconstruction  equations  for  some  fixed  time  n.  Once  the  initial 
conditions  have  been  established,  n  is  incremented  and  the  sequence  is 
reconstructed. 

Finally,  it  is  worth  noting  that  reconstruction  can  be  performed 
directly  from  sampled  data.  For  example,  assume  that  only  every  other 
time-domain  sample  is  available  from  the  analysis.  The  synthesis  can  be 
advanced  two  time  steps,  rather  than  one  step  at  a  time,  and  smoothing 
accomplished  by  an  order  of  refinement  different  than  that  described  by 
Equation  3.27.  Alternatively,  a  weighting  function  Wm(i.)  which  is 
nonzero  only  for  1=0  and  1=2  can  be  used  in  a  modified  version  of  the 
reconstruction  algorithm.  Although  these  approaches  produce  results 
comparable  to  those  produced  by  simply  smoothing  the  sampled  data  prior 
to  reconstruction,  they  require  considerably  more  computation  time  and 
are  therefore  less  practical. 


3.5  CONCLUSION 


In  this  chapter,  general  guidelines  for  practical  analysis/synthesis 
systems  have  been  established.  These  guidelines  indicate  that  STE  (or 
some  other  constraint)  must  be  used  to  prevent  out-of-band  components 
from  dominating  the  reconstructed  signal.  The  analysis  must  use  an 
overlapped  bandpass  filter  bank  in  which  the  filters  do  not  possess  both 
constant  passband  gain  and  steep  skirts.  The  speech  analysis/synthesis 
system  based  on  perception  meets  these  requirements. 

In  general,  a  transmission  channel  data  rate  which  is  twice  the 
original  sampling  rate  must  be  used  to  achieve  exact  signal 
reconstruction.  However,  if  the  F/D  bank  does  not  cover  the  full  range 
of  possible  frequencies,  then  a  lower  rate  can  be  used.  Under  these 
conditions,  only  signals  within  the  range  of  the  F/D  bank  can  be 
reconstructed.  Thus,  unlike  other  systems  which  require  an  Increase  in 
transmission  channel  bandwidth  when  the  sampling  rate  is  increased,  this 
system  produces  results  which  are  Independent  of  the  original  signal 
sampling  rate. 

The  new  signal  reconstruction  algorithm  described  in  this  chapter  is 
presently  the  only  one  known  which  is  capable  of  performing 
reconstruction  from  data  produced  by  a  critical  bandwidth  F/D  bank.  The 
algorithm  is  an  extension  of  an  algorithm  described  by  Nawab,  et  al  134], 
The  extension  introduces  a  weighted  mean  squared  error  criterion  and 
non-causal  processing  to  achieve  practical  results.  The  new  algorithm  is 
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applicable  Co  systems  using  both  HR  and  FIR  analysis  filters,  and  exact 
reconstruction  can  be  obtained  in  the  absence  of  data  modification.  In 
the  absence  of  substantial  data  modification,  reconstruction  can  be 
accomplished  in  very  little  time  by  choosing  a  small  value  for  Ymax*  Tlie 
algorithm  can  incorporate  measurements  of  different  types  (such  as 
Short-Time  Energy),  reconstruction  can  be  accomplished  from  a  limited 
range  of  frequencies,  and  contributions  to  error  can  be  weighted 
according  to  frequency  band  if  desired. 

The  new  algorithm  uses  a  sub-optimum  reconstruction  approach  with  a 
sub-optimum  error  criterion,  and  does  not  generally  minimize  the  total 
error  e total*  However,  it  may  not  be  possible  to  determine  the  optimum 
solution  with  a  finite  amount  of  computation  when  infinite-length  window 
functions  are  involved.  When  the  special  case  of  an  analysis  using 
uniformly  spaced  constant-bandwidth  FIR  filters  spanning  the  full 
frequency  range  is  considered,  other  techniques  are  available  which 
attempt  to  minimize  total  error  (eg.,  Griffin  and  Lim  [10];  Musicus 
[41]).  The  error  criterion  for  the  constant-bandwidth  case,  however,  is 
not  perception-based.  Although  the  new  algorithm  presented  in  this 
chapter  does  not  necessarily  minimize  e total*  tlie  error  value  e(n)  is 
reduced  with  each  refinement  of  the  estimated  sequence  values.  Note  that 
the  error  criterion  can  be  either  local  or  global,  but  a  local  criterion 
is  used  to  simplify  the  algorithm  and  reduce  computation  time. 


CHAPTER  4 


EXAMPLES 


4.1  INTRODUCTION 

In  this  chapter,  operation  of  the  speech  analysis/synthesis  system 
based  on  perception  is  demonstrated.  Examples  of  tone  bursts,  tone  pair 
bursts,  synthetic  vowels,  and  natural  speech  signals  are  analyzed, 
subjected  to  a  short-time  spectral  modification,  and  synthesized. 
Although  the  analysis/synthesis  system  is  actually  implemented  using  a 
discrete-time  approach,  the  examples  are  presented  as  continuous-time 


functions 


4.2  TONE  BURST 


Fig.  4.1  presents  a  1  KHz  tone  burst  of  32  millisecond  duration. 
Since  the  tone  burst  is  essentially  a  bandlimited  signal,  no 
pre-filtering  was  applied  to  suppress  components  outside  the  200-3675  Hz 
frequency  range.  The  tone  burst  was  analyzed  by  the  speech  analysis 
system  described  in  Sections  2.4.5  and  2.5.  The  resulting 
Filter/Detector  (F/D)  outputs,  which  are  computed  via  the  Generalized 
Short-Time  Fourier  Transform  (GSTFT)  magnitude  squared,  are  shown  in  Fig. 
4.2.  The  symbol  "E“  denotes  Short-Time  Energy  (STE),  and  channel  numbers 
correspond  to  the  filter  numbers  of  Table  2.1.  The  amplitude  scale  of 
Fig.  4.2  is  logarithmic.  A  logarithmic  scale  is  used  in  order  to  reveal 
features  which  might  otherwise  be  obscured,  and  to  approximate  perceived 
loudness  effects  (Siebert  [15]).  After  an  initial  transient,  all  F/D 
outputs  reach  a  steady-state  value,  and  a  final  transient  occurs  at  the 

end  of  the  tone  burst.  The  highest  value  is  attained  in  Channel  7  since 

this  channel  has  a  center  frequency  of  1  KHz.  From  Fig.  2.12  it  follows 
that  the  steady-state  level  of  Channel  1  is  -55dB  and  Channel  15  is  -43dB 
re  Channel  7.  Fig.  4.2  can  be  re-plotted  to  show  log  amplitude  as  a 
function  of  frequency  with  time  as  a  parameter.  Such  a  three-dimensional 
(3D)  running  spectrum  plot  is  presented  in  Fig.  4.3. 

The  reconstruction  algorithm  described  in  Section  3.4  was  applied  to 
the  data  of  Fig.  4.3,  and  the  resulting  signal  is  shown  in  Fig.  4.4.  The 
reconstructed  signal  of  Fig.  4.4  is  indistinguishable  from  the  original 
signal  of  Fig.  4.1,  and  is  generally  accurate  to  four  significant 

digits.  Thus,  Che  GSTFT  magnitude  is  a  complete  means  of  signal 

representation  (apart  from  an  arbitrary  overall  sign  factor). 
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The  reconstructed  signal  was  analyzed,  and  the  resulting  values  were 
compared  to  the  values  of  Fig.  4.3  in  accordance  with  Equation  3.24.  The 
resulting  error  is  shown  in  Fig.  4.5.  This  plot  is  normalized  to  the 
peak  error  value  of  4.9x10^.  The  area  under  the  graph  of  Fig.  4.5 
corresponds  to  the  total  error,  e total*  Note  that  the  error  plot  of  Fig. 
4.5  is  a  function  of  data  values  raised  to  the  fourth  power.  Thus, 
reducing  the  tone  burst  amplitude  by  a  factor  of  two  reduces  the  error 
plot  by  a  factor  of  sixteen.  In  order  to  obtain  an  error  measure  which  is 
independent  of  signal  level,  the  total  normalized  error  is  computed  in 
accordance  with  Equation  3.26.  For  this  reconstruction  example, 
etotal,norm  =  1*0x10  ^ • 

Next,  a  short-time  spectral  modification  was  employed  in  which 
sixteen  time-domain  samples  in  each  channel  were  averaged,  and  each 
sample  was  replaced  with  the  average  value.  The  resulting  modified  data 
is  shown  in  Fig.  4.6.  This  modification,  which  is  employed  for 
demonstration  purposes,  can  be  described  in  terras  of  Fig.  3.1.  Since 
sixteen  channels  are  used  and  the  data  rate  of  each  channel  has  been 
reduced  by  a  factor  of  sixteen,  the  transmission  channel  data  rate  of 
Fig.  3.1  is  the  same  as  the  sampling  rate  of  the  original  signal.  Thus, 
in  general,  exact  reconstruction  from  this  modified  data  is  impossible. 
The  analysis  transformation  A  uses  a  "boxcar"  lowpass  filter  (ie.,  a 
filter  with  a  constant  amplitude,  finite  length  unit-sample  response) 
followed  by  downsampling  in  each  channel.  The  corresponding  synthesis 
transformation  S  uses  upsampling  followed  by  a  boxcar  lowpass  filter  in 
each  channel.  Thus,  the  data  of  Fig.  4.3  is  the  input  to  A,  and  the  data 
of  Fig.  4.6  is  the  output  from  S. 
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Figure  4.5:  Error  (1000  Hz,  Unmodified  Data) 


The  short-time  spectral  modification  used  to  obtain  Fig.  4.6  is  of 
the  type  commonly  employed  in  Automatic  Speech  Recognizer  front-ends 
(Section  5.4),  channel  vocoders  (Section  D.2),  and  power  spectrum 
estimation  techniques  (Section  D.5)  for  data  reduction  purposes,  although 
such  applications  typically  average  together  a  far  greater  number  of 
samples.  This  simple  data  modification  technique  will  be  used  to 
demonstrate  many  aspects  of  the  reconstruction  algorithm. 

The  reconstruction  algorithm  was  applied  to  the  data  of  Fig.  4.6, 
and  the  signal  of  Fig.  4.7  was  obtained.  The  reconstruction  is  roughly  a 
tone  burst  of  correct  amplitude,  frequency,  and  duration.  Since  the 
modified  data  of  Fig.  4.6  differs  most  from  the  unmodified  data  of  Fig. 
4.3  at  the  beginning  and  end  of  the  burst,  the  reconstruction  bears  least 
resemblance  to  the  original  signal  at  the  beginning  and  end  of  the  burst. 

In  order  to  verify  the  algorithm  operation,  the  reconstructed  signal 
of  Fig.  4.7  was  analyzed,  producing  the  3D  plot  of  Fig.  4.8.  Comparison 
of  Figs.  4.3,  4.6,  and  4.8  demonstrates  the  ability  of  the  algorithm  to 
reconstruct  a  real-valued  signal  having  short-time  spectral 
characteristics  which  match  the  given  data.  Since  the  plots  are  on  a 
logarithmic  scale,  low  level  differences  may  appear  exaggerated. 

Reconstruction  error  is  plotted  in  Fig.  4.9.  The  peak  error  value 
of  9.7x10^  and  the  total  normalized  error  value  of  6.8x10“^  are  many 
orders  of  magnitude  greater  than  values  for  the  previous  example.  Since 
significant  error  occurs  only  at  the  beginning  and  end  of  the  tone  burst, 
the  total  normalized  error  decreases  with  increasing  tone  burst  duration 
for  this  example. 
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Figure  4.9:  Error  (1000  Hz,  Modified  Data) 
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Ic  will  be  shown,  via  several  examples,  that  Che  reconstruction 
algorithm  of  Section  3.4  performs  well  in  the  presence  of  short-time 
spectral  modifications.  Although  no  attempt  will  be  made  to  optimize  the 
algorithm  for  any  particular  modification,  it  is  possible  to  reduce 
reconstruction  error  by  doing  so.  For  example,  error  may  be  reduced  by 
choosing  an  error  weighting  function  which  extends  over  several  periods 
of  the  modification.  This  approach,  however,  significantly  Increases 
computation  time  and  implementation  complexity,  and  will  not  be 
considered  here.  Note  that  the  largest  error  peak  of  Fig.  4.9  can  be 
reduced  by  simply  setting  the  reconstructed  sequence  values  to  zero  prior 
to  t=.0032  sec.  This  can  be  done  based  on  the  fact  that  STE  and  all  F/D 
channels  are  zero  prior  to  this  time.  The  reconstruction  algorithm 
produced  nonzero  values  in  Fig.  4.7  because  the  modified  data  changed 
abruptly  rather  than  in  a  bandlimited  fashion.  Since  the  model  of  the 
analysis  system  contained  in  the  reconstruction  process  (see  Fig.  3.2) 
produces  only  bandlimited  functions,  and  the  modified  data  does  not  agree 
with  the  model,  a  spike  occurs  in  the  error  whenever  a  discontinuity 
occurs  in  the  data.  This  effect  can  be  seen  by  comparing  Figs.  4.6,  4.8, 
and  4.9.  In  order  to  reduce  error  spike  amplitudes,  a  smoother 
short-time  spectral  modification  must  be  chosen.  For  example, 
considering  each  channel  separately,  when  the  value  at  each  discontinuity 
in  Fig.  4.6  is  replaced  by  an  average  of  the  surrounding  steady-state 
values,  the  maximum  error  value  is  reduced  nearly  202.  This  error 
reduction  was  accomplished  without  setting  any  values  to  zero  prior  to 
t~.0032  sec.  Since  the  resulting  surface  is  somewhat  smoother,  the 
reconstruction  algorithm  produces  a  signal  having  a  short-time  spectrum 


which  better  matches  the  smoothed  data.  However,  the  fact  that 
reconstruction  error  is  reduced  does  not  generally  indicate  that  a  signal 
reconstructed  from  smoothed  modified  data  will  bear  closer  resemblance  to 
the  original  signal.  Thus,  such  smoothing  will  not  be  employed  as  an  aid 
to  reconstruction  from  modified  data.  For  demonstration  purposes,  the 
algorithm  described  in  Section  3.4  will  be  applied  directly  in  all 
examples . 


4.3  TONE  PAIR  BURSTS 


A  450  and  2500  Hz  Cone  pair  burst  is  shown  in  Fig.  4.10.  From  Fig. 
2.12  it  can  be  seen  that  the  filters  having  center  frequencies  at  450  and 
2500  Hz  overlap  at  the  -40dB  level.  Thus,  since  the  signals  are  widely 
separated  in  frequency,  the  analysis  of  Fig.  4.11  does  not  reveal  any 
interaction  between  the  two  component  sinusoids.  Each  F/D  output  reaches 
a  steady-state  value  (see  Section  B.3.7).  A  short-time  spectral 
modification  was  applied  by  averaging  sixteen  samples  in  each  channel, 
and  each  sample  was  replaced  with  the  average  value  as  shown  in  Fig. 
4.12.  The  reconstruction  algorithm  was  applied  to  the  modified  data,  and 
the  result  is  shown  in  Fig.  4.13.  The  reconstructed  signal  was  then 
analyzed,  and  the  results  are  shown  in  Fig.  4.14.  Corresponding  error  is 
plotted  in  Fig.  4.15,  and  is  comparable  to  the  single  tone  burst  case 
shown  in  Fig.  4.9,  although  the  peak  error  value  of  1.2x10*6  is 
considerably  less  due  to  a  reduction  in  average  input  signal  level.  The 
total  normalized  error  value  of  8.4x10"-*  is  comparable  to  that  of  the 
single  tone  burst  case  since  both  signals  are  of  the  same  duration.  As 
in  the  single  tone  burst  data  modif ication  example,  total  normalized 
error  is  a  function  of  tone  pair  burst  duration  for  this  example. 

A  1000  and  1600  Hz  tone  pair  burst  is  shown  in  Fig.  4.16.  Since  the 
filters  at  the  corresponding  center  frequencies  overlap  at  the  -18dB 
level,  some  interaction  between  the  spectral  components  is  visible  in  the 
analysis  of  Fig.  4.17.  Each  F/D  output  consists  of  a  constant  and  a  beat 
frequency  component  (see  Section  B.3.8).  When  the  short-time  spectral 
modification  is  applied,  the  beat  frequency  component  is  eliminated  as 
shown  in  Fig.  4.18.  Thus,  the  surface  of  Fig.  4.18  represents 
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Figure  4.12:  3D  Plot  of  Modified  Data 


Inconsistent  information.  On  one  hand,  the  data  indicates  presence  of  two 
sine  waves  because  two  spectral  peaks  are  visible.  On  the  other  hand,  if 
two  sine  waves  are  present  then  beat  frequencies  should  occur,  but  none 

are  visible  in  the  data  of  Fig.  4.18.  Thus,  given  the  inconsistent  data 

of  Fig.  4.18,  a  reasonable  signal  reconstruction  approach  might  be  to 
first  choose  two  sinusoidal  components  as  indicated  by  the  spectral 
peaks.  A  low-level  periodic  waveform  having  amplitude  and  frequency 
determined  in  accordance  with  an  error  criterion  could  then  be  added  to 
the  two  sinusoids,  thereby  reducing  the  beat  frequencies  in  order  to 
approximate  the  data  of  Fig.  4.18.  This  was  exactly  the  result  obtained 
upon  application  of  the  reconstruction  algorithm  to  the  data  of  Fig. 
4.18,  as  shown  in  the  reconstruction  of  Fig.  4.19.  Analysis  of  the 
reconstructed  signal  is  shown  in  Fig.  4.20,  and  it  can  be  seen  that  the 
reconstruction  algorithm  inserts  a  third  sinusoidal  component  to 
compensate  for  the  inconsistent  data  of  Fig.  4.18.  The  plot  of  Fig.  4.21 

reveals  a  530  Hz  oscillation  in  the  error.  Since  there  are  few 

discontinuities  in  the  data  of  Fig.  4.18,  there  are  few  spikes  in  the 
error  plot  of  Fig.  4.21.  Since  the  area  under  this  error  plot  during  the 
steady-state  portion  of  the  tone  pair  burst  is  greater  than  the  area 
under  the  error  transients  at  the  beginning  and  end  of  the  burst,  the 
total  normalized  error  does  not  depend  strongly  on  tone  pair  burst 
duration.  The  total  normalized  error  value  of  3.1x10”^  is  nearly  four 
times  that  of  the  previous  tone  pair  burst  example,  and  the  peak  error 
value  of  2.4x10*6  is  twice  that  of  the  previous  example. 
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Figure  4.20:  3D  Plot  of  Analyzed  Reconstruction 


Fig.  4.24.  A  poor  quality  reconstruction  Is  obtained,  as  shown  in  Fig. 
4.25.  The  analyzed  reconstruction  is  shown  in  Fig.  4.26,  and  the 
corresponding  error  in  Fig.  4.27.  The  peak  error  value  of  4.6x10^  and 
total  normalized  error  value  of  5.6x10  c  are  the  largest  of  any  examples 
thus  far.  Again,  since  the  area  under  the  error  plot  during  the 
steady-state  portion  of  the  tone  pair  burst  is  greater  than  the  area 


Figure  4.24:  3D  Plot  of  Modified  Data 


4.4 


SYNTHETIC  VOWELS 


Synthetic  vowels  provide  a  controlled  speech-like  signal  for  testing 
and  demonstration  purposes.  Such  vowels  can  be  conveniently  generated 
via  an  acoustic  tube  vocal  tract  model  (Rabiner  and  Schafer  [3]).  An 
example,  the  synthetic  vowel  /E/  as  in  "bet,*’  is  shown  in  Fig.  4.28. 
Vowel  sounds  are  often  characterized  in  terms  of  their  spectral  peaks,  or 
formants  (Peterson  and  Barney  [42]).  This  vowel  has  a  first  formant 
frequency  FI  of  530  Hz.  The  second  and  third  formants  are  F2=1840  and 
F3=2480  Hz.  Formant  bandwidths  are  40  Hz  for  FI,  60  Hz  for  F2,  and  100 
Hz  for  F3.  The  pitch  frequency  is  F0*125  Hz,  so  a  male  speaker  is 
simulated.  The  pitch  is  visible  in  the  time  dimension  and  formant  peaks 
are  visible  in  the  frequency  dimension  of  Fig.  4.29.  Note  that  FI  has  a 
far  higher  level  than  F2  or  F3,  and  thus  is  the  most  important  feature 
for  reconstruction  spectral  matching  purposes. 

As  in  previous  examples,  the  data  was  modified  by  averaging  sixteen 
time-domain  samples  in  each  channel,  and  replacing  each  sample  with  the 
average  value  as  shown  in  Fig.  4.30.  The  reconstruction  algorithm  was 
applied  to  the  modified  data,  and  the  results  are  shown  in  Fig.  4.31. 
The  reconstructed  signal  was  then  analyzed,  and  the  result  is  shown  in 
Fig.  4.32.  A  comparison  of  Figs.  4.32  and  4.30  reveals  that  FI  of  the 
analyzed  reconstructed  signal  provides  a  good  match  to  the  modified 
spectrum.  This  observation  is  supported  by  the  corresponding  error  shown 
in  Fig.  4.33.  The  peak  error  value  of  2,0x10*6  is  comparable  to  that  of 
the  tone  pair  burst  examples  since  similar  average  signal  levels  are 
used.  The  total  normalized  error  value  of  4.3xlO-^  is  also  comparable  to 
previous  examples,  and  does  not  depend  strongly  on  signal  duration. 


A  similar  synthetic  vowel,  /AE/  as  in  "bat,”  is  shown  in  Fig.  4.34. 
This  vowel  has  formant  frequencies  Fl=660,  F2*1720,  and  F3=2410  Hz. 
Formant  bandwidths  and  pitch  are  the  same  as  for  the  previous  example. 
The  analyzed  signal  is  shown  in  Fig.  4.35.  The  data  was  modified  as  shown 
in  Fig.  4.36.  In  this  example,  the  short-time  spectral  modification 
produces  large  discontinuities  in  FI  as  compared  to  the  previous  example. 
The  reconstruction  algorithm  was  applied  to  the  modified  data,  and 
results  are  shown  in  Fig.  4.37.  The  reconstructed  signal  was  analyzed  as 
shown  in  Fig.  4.38.  A  comparison  of  Figs.  4.38  and  4.36  reveals  a 
relatively  poor  match  between  FI  of  the  analyzed  reconstructed  signal  and 
the  modified  spectrum,  due  to  the  reconstruction  algorithm's  inability  to 
model  discontinuities.  The  discontinuities  also  cause  large  spikes  in  the 
corresponding  error  of  Fig.  4.39.  Although  the  peak  error  value  of 
2. 0x10* 6  is  the  same  as  the  previous  example,  Che  total  normalized  error 
value  of  8.4xl0-2  is  twice  that  of  the  previous  example. 
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Figure  4.36:  3D  Plot  of  Modified  Data 
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4.5  NATURAL  SPEECH  SIGNALS 


Fig.  4.40  presents  a  time-domain  plot  of  the  sentence  “Their  hot 
protein  can  pace  on  our  breakdowns"  as  spoken  by  a  male  subject.  This 
signal  has  been  pre-filtered  to  suppress  components  outside  the  200-3675 
Hz  frequency  range.  The  signal  of  Fig.  4.40  was  analyzed,  and  Fig.  4.41 
is  a  plot  of  the  resulting  F/D  outputs.  In  order  to  reduce  the  figure 
size,  one  of  every  eight  F/D  output  samples  (in  the  time  domain)  was  used 
to  create  the  3D  plot  of  Fig.  4.42.  Many  features  of  the  speech  signal, 
such  as  vowel  structures,  can  be  seen  in  the  analysis  of  Fig.  4.42. 
Interpretation  of  this  type  of  speech  display  is  discussed  by  Searle 
[43],  [44].  The  reconstruction  algorithm  was  applied  to  the 
non-downsampled  data  of  Fig.  4.41,  and  the  result  is  shown  in  Fig.  4.43. 
Except  for  an  overall  sign  factor,  the  reconstruction  of  Fig.  4.43  is 
indistinguishable  from  the  original  signal  of  Fig.  4.40. 

For  demonstration  purposes,  a  short  phrase  “their  hot,”  shown  in 
Fig.  4.44,  was  obtained  from  the  sentence  of  Fig.  4.40.  The  short  phrase 
was  analyzed,  and  a  portion  of  the  results  are  shown  in  the  3D  plot  of 
Fig.  4.45.  The  reconstruction  algorithm  was  applied  to  the  unmodified 


F/D  outputs,  and  the  signal  of  Fig.  4.46  was  obtained.  The  reconstructed 
signal  of  Fig.  4.46  is  indistinguishable  from  the  original  signal  of  Fig. 
4.44,  and  has  the  same  overall  polarity. 
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Figure  4.42  Continued 
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Figure  4. 45:  3D  Plot  of  Unmodified  Data 


The  F/D  outputs  were  then  modified  by  averaging  sixteen  time-domain 
samples  together  in  each  channel,  and  replacing  the  samples  with  average 
values.  A  portion  of  the  modified  data  is  shown  in  the  3D  plot  of  Fig. 
4.47,  and  the  resulting  reconstruction  is  shown  in  Fig.  4.48.  Note  that 
the  correct  pitch  has  been  retained,  and  the  reconstructed  signal  appears 
somewhat  noisy.  The  analyzed  reconstruction  is  shown  in  Fig.  4.49,  and 
the  corresponding  error  in  Fig.  4.50.  Since  the  average  signal  level  is 
less  than  in  previous  examples,  the  peak  error  value  of  7.4x10^  is  also 
less.  The  total  normalized  error  value  of  7.4xl0-^,  however,  is 
comparable  to  that  of  previous  examples.  The  total  normalized  error  does 
not  depend  strongly  on  signal  duration  for  the  examples  given  in  this 


section 


Next,  a  short-time  spectral  modification  was  considered  in  which  the 
data  was  not  so  severely  distorted.  This  modification  averaged 
time-domain  samples  together  in  each  channel  and  replaced  the  samples 
with  average  values,  but  fewer  samples  were  averaged  in  the  high 
frequency  channels.  To  ensure  that  all  channels  were  modified  to  some 
extent,  two  samples  were  averaged  together  in  each  of  the  high  frequency 
channels.  More  samples  were  averaged  in  lower  frequency  channels 
according  to  their  bandwidth.  Specifically,  7  samples  were  averaged  in 
the  energy  channel,  6  samples  in  Channels  // 1— 2 ,  5  in  #3-4,  4  in  #5-7,  3 

in  #8-10,  and  2  in  #11-15.  Note  that  this  modification  is  different  from 
the  modification  used  in  all  previous  examples,  where  the  same  number  of 
samples  was  averaged  regardless  of  filter  bandwidth.  This  short-time 
spectral  modification  corresponds  more  closely  to  a  simple  decimation  and 
interpolation  of  the  F/D  and  STE  outputs,  as  discussed  in  Section  2.6.  A 
portion  of  the  slightly  modified  data  is  shown  in  Fig.  4.51,  and  the 
resulting  reconstruction  is  shown  in  Fig.  4.52.  The  reconstructed  signal 
is  similar  to  the  original,  although  differences  are  clearly  visible. 
The  reconstructed  speech  sounds  quite  similar  to  the  original  signal,  but 
the  two  signals  are  audibly  distinguishable.  The  analyzed  reconstruction 
is  shown  in  Fig.  4.53,  and  the  corresponding  error  in  Fig.  4,54.  The  peak 
error  value  of  1.7x10^  and  the  total  normalized  error  value  of  9.3xlO-^ 
are  far  less  than  values  obtained  for  the  previous  example. 


154 


////////////////// 
E  .  .  12345678*1  10  11 121314 15 

FREQUENCY 


CX  1  M3.UT  T  :1 


Figure  4.53:  3D  Plot  of  Analyzed  Reconstruction  (Slightly  Modified  Data) 


Figure  4.54:  Error  (Slightly  Modified  Data) 


Finally,  a  short-time  spectral  modification  was  considered  in  which 
the  data  was  highly  distorted.  The  F/D  outputs  were  modified  by 
averaging  many  samples  together  in  each  channel,  and  replacing  the 
samples  with  average  values.  Specifically,  74  samples  were  averaged  in 
the  energy  channel,  50  samples  in  Channels  #1-2,  45  in  #3,  41  in  #4,  35 

in  #5,  33  in  #6,  31  in  #7,  26  in  #8,  23  in  #9,  20  in  #10,  17  in  #11,  15 
in  #12,  13  in  #13,  11  in  #14,  and  9  in  #15.  The  number  of  samples 

averaged  in  each  channel  corresponds  to  the  minimum  sampling  rate  for  the 
channel  based  on  3dB  bandwidths,  ie.,  5000  divided  by  the  critical 
bandwidth  (see  Section  2.6).  Since  values  are  averaged,  however,  the 
resulting  data  is  highly  modified  and  unsuitable  for  signal  recovery 
purposes.  The  resulting  transmission  channel  data  rate  of  7276  samples 
per  second  is  less  than  the  original  signal  sampling  rate  of  10,000 
samples  per  second.  Therefore,  this  spectral  distortion  is  more  severe 
than  any  considered  in  previous  examples.  A  portion  of  the  highly 
modified  data  is  shown  in  Fig.  4.55,  and  the  resulting  reconstruction  is 
shown  in  Fig.  4.56.  The  averaging  process  destroys  periodic  pitch 
information,  and  the  reconstructed  signal  appears  quite  noisy.  The 
overall  envelope  of  the  reconstructed  waveform,  however,  is  similar  to 
the  original  waveform  envelope.  The  waveform  reconstructed  from  highly 
modified  data  sounds  like  very  noisy  speech.  Analysis  of  the 
reconstructed  signal  is  shown  in  Fig.  4.57,  and  the  corresponding  error 
in  Fig.  4.58.  Although  the  peak  error  value  of  6.9x10^  is  comparable 
with  the  value  obtained  in  the  modified  data  example  of  Fig.  4.50,  the 
total  normalized  error  value  of  1.2xl0-^  is  greater  since  the  area  under 
the  plot  of  Fig.  4.58  is  greater  than  the  area  under  the  plot  of  Fig. 
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Figure  4.55:  3D  Plot  of  Highly  Modified  Data 
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Figure  4.57:  3D  Plot  of  Analyzed  Reconstruction  (Highly  Modified  Data) 


CHAPTER  5 


SUMMARY  AND  SUGGESTIONS  FOR  FURTHER  RESEARCH 


5.1  SUMMARY 

This  report  has  presented  a  speech  analysis/synthesis  system  based 
on  perception.  A  nonuniforra  Filter/Detector  (F/D)  bank  and  optional 
Short-Time  Energy  constraint  formed  the  analysis  system.  F/D  bank 
characteristics  were  determined  from  a  combination  of  physiological  and 
psychoacoustic  results.  A  new  relationship  demonstrated  that  the  F/D 
bank  could  be  implemented  by  the  Generalized  Short-Time  Fourier  Transform 
(GSTFT)  magnitude,  and  a  digital  implementation  suitable  for  real-time 
analysis  was  given.  For  speech  synthesis,  a  new  approach  capable  of 


reconstructing  signals  from  the  GSTFT  magnitude  was  used.  The  speech 
analysis/synthesis  system  achieved  exact  reconstruction  in  the  absence  of 
data  modification.  The  ability  of  the  synthesis  system  to  reconstruct 
speech  from  modified  data  was  also  demonstrated. 


5.2  REAL-TIME  SYNTHESIS 

Although  the  analysis  system  described  in  Chapter  2  and  Appendix  C 
is  suitable  for  real-time  operation  using  existing  technology,  the 
synthesis  system  of  Chapter  3  generally  is  not.  Further  improvements  in 
the  synthesis  algorithm,  however,  may  produce  a  real-time 
analysis/synthesis  system  based  on  perception.  For  example,  the  triangle 
and  Schwartz  inequalities  (Churchill,  Brown,  and  Verhey  [45])  can  be 
applied  to  the  recursive  GSTFT  of  Equation  2.29,  resulting  in  an 
expression  which  directly  relates  reconstructed  sequence  values  with  the 
GSTFT  magnitude.  It  may  be  possible  to  perform  crude  real-time  synthesis 
from  such  results.  Alternatively,  it  may  be  possible  to  use  a  synthesis 
approach  similar  to  that  employed  by  channel  vocoders  (Section  D.2). 


5.3  DATA  REDUCTION 


AlChough  data  reduction  is  not  an  essential  part  of  an  auditory 
system  model,  it  may  be  useful  in  many  applications.  When  little  data 
reduction  is  required,  the  standard  dovnsampllng/upsampling  approach  of 
Section  2.6  is  applicable.  When  a  high’  degree  of  data  reduction  is 
required,  more  sophisticated  approaches  may  be  used.  For  example,  it  is 
clear  from  Figs.  4.41  and  4.42  that  an  efficient  encoding  can  be 
accomplished  by  matching  the  STE  and  F/D  output  time-domain  waveforms 
with  a  few  well-chosen  prototype  wave  shapes.  Such  an  encoding  can  be 
performed  automatically  by  a  principal  components  approach  (Chu  (16J). 
Effectively,  the  principal  components  analysis  applied  to  the  temporal 
domain  performs  a  type  of  pitch  extraction.  A  principal  components 
synthesis,  followed  by  signal  reconstruction  from  the  resulting  modified 
data,  produces  a  signal  which  sounds  quite  similar  to  channel  vocoded 
speech  (see  Section  D.2).  Speech  can  be  obtained  via  this  approach  using 
transmission  channel  data  rates  on  the  order  of  10,000  bits  per  second 
(not  samples  per  second).  Further  research  in  this  area  may  prove 
beneficial  to  the  design  of  channel  vocoders  based  on  properties  of  the 
human  auditory  system  (Gold  and  Tierney  [46]). 
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5.4  AUTOMATIC  SPEECH  RECOGNITION  MACHINE  DESIGN 


When  an  Automatic  Speech  Recognition  (ASR)  machine  fails  to 
correctly  identify  a  spoken  input  word,  the  failure  may  be  due  to 
inadequacies  in  the  first  processing  stage,  or  "front-end."  Note  that 
front-end  inadequacies  can  cause  unavoidable  errors  in  subsequent  stages. 
Since  the  new  algorithm  described  in  Chapter  3  is  the  only  known  means  of 
reconstructing  speech  from  critical  bandwidth  F/D  outputs,  it  provides  a 
new  tool  for  ASR  machine  front-end  design.  Front-end  inadequacies  can 
now  be  discovered  when  a  synthesis  technique  is  used  to  test  the  analyzed 
speech  data  for  suitable  information  content. 

The  need  for  a  synthesis  system  in  ASR  machine  front-end  testing  can 
be  illustrated  by  a  few  simple  examples.  A  bank  of  bandpass  filters 
having  constant  passband  gain  and  minimum  passband  overlap  is  often  used 
in  ASR  front-ends  (Schafer,  Rabiner,  and  Herrmann  [47 J;  Rubinstein  and 
Silverman  148];  Dautrich,  Rabiner,  and  Martin  [49],  [50]).  If  the 

filters  are  carefully  designed,  it  is  possible  to  reconstruct  the  input 
signal  by  simply  adding  the  non-detected  filter  outputs  together.  When 
the  filters  are  followed  by  detectors,  however,  practical  reconstruction 
of  signals  from  the  resulting  F/D  bank  outputs  is  impossible.  For 
instance,  tones  of  widely  different  frequencies  produce  identical 
steady-state  K/D  outputs  (see  Section  3.2.2),  and  reconstruction  of  such 
signals  is  impossible.  Since  humans  have  excellent  frequency  resolution 
ability,  it  is  clear  that  this  type  of  F/D  bank  cannot  be  used  to  perform 
many  waveform  discrimination  tasks  easily  performed  by  humans.  For 


examples  relevant  to  the  task  of  speech  recognition,  consider  the 
synthetic  vowels  /E/  and  /AE/  depicted  in  Figs.  5.1,  5.2,  and  5.3.  In 
order  to  achieve  control  over  each  individual  spectral  component,  these 
vowels  were  created  by  adding  sine  waves  rather  than  using  an  acoustic 
tube  vocal  tract  model  as  in  Section  4.4.  Assume  that  a  critical 
bandwidth  F/D  bank  having  constant  passband  gain  and  minimum  passband 
overlap  is  designed  by  interpolating  the  data  of  Table  2.1.  When  the  F/D 
bank  is  used  to  analyze  the  synthetic  vowels  of  Fig.  5.1,  it  follows  from 
Sections  3.2.2,  B.3.7,  and  B.3.8  that  the  two  vowels  yield  identical 
steady-state  outputs.  Thus,  it  is  impossible  for  an  ASR  machine  equipped 
with  such  a  front-end  to  distinguish  between  these  steady-state  sounds.  A 
similar  result  holds  for  several  other  vowel  pairs  including  the  /OW/ 
sound  in  "bought"  and  the  /U/  sound  in  "foot,"  as  well  as  the  /UH/  sound 
in  "but"  and  the  /ER/  sound  in  "bird."  The  importance  of  this  effect  with 
regard  to  specific  speech  recognition  vocabularies  is  a  topic  for  further 
research,  and  a  speech  synthesis  system  similar  to  that  described  in 
Chapter  3  can  be  applied  to  test  the  results.  Of  course,  such  problems 
are  avoided  altogether  when  the  speech  analysis  system  of  Chapter  2  is 
used. 


In  addition  to  testing  front-ends,  the  synthesis  approach  can  be 
used  to  test  effects  of  subsequent  processing  stages.  Such  tests  can 
reveal  loss  of  information  relevant  to  recognition  of  a  given  vocabulary. 
Note  that  loss  of  irrelevant  information  may  be  useful  for  data  reduction 
purposes.  Such  loss  is  acceptable  so  long  as  the  nature  of  the  loss  is 
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Vowels  Yielding  Identical  F/D  Outputs 


understood  and  the  results  can  be  tested.  Testing  is  accomplished  via 
synthesis  from  the  modified  data,  as  demonstrated  in  Chapter  4.  For 
example,  to  achieve  data  reduction  an  additional  narrow  lowpass  filter  is 
often  placed  at  each  F/D  output  or,  equivalently,  a  narrow  lowpass 
smoothing  filter  is  used  in  the  detector.  The  short-time  spectral 
modification  examples  of  Sections  4.4  and  4.5  indicate  that  a  great  deal 
of  information  is  lost  when  speech  is  processed  by  such  a  system.  The 
information  loss,  however,  may  or  may  not  be  important  for  a  specific 
speech  recognition  vocabulary.  Again,  this  is  a  topic  for  further 
research.  Note  that  approaches  to  data  reduction  other  than  narrow 
lowpass  filtering  can  be  used  which  do  not  sacrifice  intelligibility  of 
the  reconstructed  speech  (Section  5.3).  Such  approaches  are  therefore 
suitable  for  a  wider  variety  of  vocabularies. 


The  preceding  observations  are  consistent  with  experimental  results 
reported  in  the  literature.  For  example,  a  recent  study  (Dautrich,  et  al 
[49],  [50])  has  shown  that  a  word  recognizer  based  on  Linear  Predictive 
Coding  (LPC)  techniques  performed  better  than  a  particular  13-channel 
critical  band  F/D  bank  design.  In  this  study  the  lowpass  smoothing 
filter  cutoff  frequencies  were  chosen  so  that  each  F/D  output  could  be 
sampled  at  a  67  Hz  rate  regardless  of  the  bandpass  filter  bandwidth,  and 
the  digital  bandpass  filters  had  constant  passband  gain  and  minimum 
passband  overlap.  In  an  earlier  study  (White  and  Neely  [51]),  LPC  was 
compared  with  a  20-channel  overlapped  F/D  bank  (1/3  octave  analog  filters 


were  used  to  cover  the  100-10,000  Hz  range)  using  a  100  Hz  sampling  rate 


on  each  channel,  and  similar  scores  were  produced  by  both  the  F/D  and  LPC 
approaches.  Finally,  a  study  using  mel-f requency  cepstrum  coefficients, 
which  are  similar  to  processed  critical  band  F/D  bank  outputs,  achieved 
superior  performance  compared  to  LPC  (Davis  and  Mermelstein  [52]). 

The  comparison  of  speech  recognizers  using  different  front-ends  is  a 
difficult  task.  On  one  hand,  if  a  high  quality  speech  signal  can  be 
reconstructed  from  F/D  bank  front-end  outputs,  then  any  speech  recognizer 
errors  must  be  attributed  to  the  recognition  algorithms  rather  than 
front-end  inadequacies.  Since  a  high  quality  signal  cannot  generally  be 
reconstructed  from  data  produced  by  LPC  front-ends  (the  signal  may  not 
fit  the  model  assumed  by  LPC  analysis/synthesis),  the  F/D  bank  approach 
can  potentially  outperform  the  LPC  approach.  On  the  other  hand,  the  LPC 
approach  may  be  more  convenient  since  it  achieves  a  high  degree  of  data 
reduction.  Therefore,  an  important  topic  for  future  research  is  a 
comparison  of  speech  recognizers  using  LPC  with  those  using  F/D  bank 
front-ends  followed  by  data  reduction  approachs  which  do  not  sacrifice 
speech  intelligibility. 
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APPENDIX  A 


DEFINITIONS 


This  appendix  presents  standard  definitions  for  reference  purposes 
(for  further  information,  see  Oppenheim  and  Wilistcy  [53]).  In  the 
continuous-time  case,  the  time  variable  is  "t"  and  the  frequency  variable 
is  "U.”  In  the  discrete-time  case,  the  time  variable  is  "n"  and  the 


frequency  variable  is  "w 


The  continuous-time  Fourier  transform  of  a  signal  x(t)  is  defined 

as : 


FT{  x(t)}  =  X(jQ) 


=  /  x(t)e~j^cdt. 
-00 


(A.l) 


The  continuous-time  inverse  Fourier  transform  is: 


x(t)  ■  (1/2it)/  X(  j*2  )e^  cdft . 

—CO 


(A. 2) 


The  modulation  property  of  continuous-time  Fourier  transforms  is  given 
by: 


FT{x(c)y(c)}  “  (1/2jt  )  [X(jfi  )*Y(jfl  ) ) 


-  (l/2ir)/ 


•+« 
— oo 


X(jA)Y(jQ-jA)dA. 


(A. 3) 


The  discrete-time  Fourier  transform  of  a  signal  x(n)  is  defined  as: 
FT{  x(n)}  =  XCe^) 


00 


=  l  x(n)e“>n. 

n=-» 


(A.  4) 


The  discrete-time  inverse  Fourier  transform  is: 


x(n)  *  ( l/2ir  >/  X(e>)ej“ndw. 
-n 


(A. 5) 
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The  modulation  property  of  discrete-time  Fourier  transforms  is  given  by: 


FT{  x(n)y(n)}  =  ( l/2ir  )  [X(ejw  )*Y(eJw )  J 

=  (l/21r)/+,IX(ejX)Y(ejw_jx)dX. 

~7T 


(A. 6) 


The  z-transform  of  a  discrete-time  signal  x(n)  is  defined  as: 


X(z) 


CO 


l  x(n)z  n, 
n=-°° 


(A. 7) 


where  z  is  a  complex  variable. 


The  Laplace  transform  of  a  continuous-time  signal  x(t),  specified 
for  t>0,  is: 


LT{  x( 1 1}  =  /^x(t)e_stdt, 


(A. 8) 


where  s  is  a  complex  variable. 
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B.2  CONTINUOUS-TIME  FILTER/DETECTOR  COMPONENT  DESCRIPTION 


A  F/D  subsystem  consists  of  a  bandpass  filter  followed  by  a 
detector,  as  shown  in  Fig.  B.l.  The  detector  is  comprised  of  a 
meraoryless  nonlinearity  and  a  lowpass  smoothing  filter. 


B.2. I  BANDPASS  FILTER  DESIGN 

A  simple  design  procedure  for  Linear  Time-Invariant  (LTI)  bandpass 
filters  involves  modulating  the  impulse  response  of  a  prototype  lowpass 
filter.  Let  the  prototype  lowpass  filter  impulse  response  be  denoted  by 
h(t).  The  function  h(t)  is  also  known  as  a  window  function  because  it 
sometimes  serves  as  a  time  domain  "window"  through  which  signals  are 
viewed.  As  a  specific  example  of  the  design  procedure,  let  h(t)  be  the 
impulse  response  of  an  ideal  LTI  lowpass  filter, 

h(t)  *  [sin(fthO  ] h  t.  (B.l) 

The  window  function’s  Fourier  transform  FT{h(t)}  Is  shown  in  Fig.  B.2a. 
In  the  frequency  domain,  the  window  function  has  bandwidth  and  unity 
gain.  From  the  modulation  property  of  Fourier  transforms  (see  Appendix 
A),  the  function  h(t)sin(Dct)  is  the  impulse  response  of  a  bandpass 
filter.  Frequency  domain  magnitude  characteristics  of  the  bandpass 
filter  are  shown  in  Fig.  B.2b.  When  0<il^<31c  the  bandpass  filter 
designed  in  this  manner  has  center  frequency  Dc,  bandwidth  211^,  and  a 
gain  of  one-half. 


Figure  B.l:  General  Filter/Detector  Subsystem 


FT{  h(  t )} 


igure  B.2:  Bandpass  Filter  Design  Example 


Although  many  bandpass  filter  design  procedures  exist,  only  the 
approach  which  modulates  a  prototype  lowpass  filter  will  be  discussed. 
It  is  shown  in  Section  2.4  that  this  particular  design  technique  is  used 
in  Short-Time  Fourier  Transform  analysis.  The  technique  is  also  useful 
in  auditory  system  modeling,  as  shown  in  Section  2.2. 

In  practical  applications  a  window  function  other  than  the  Impulse 
response  of  an  ideal  lowpass  filter  is  used.  When  h(t)  is  the  Impulse 
response  of  a  non-ideal  lowpass  filter,  is  chosen  such  that  frequency 
components  in  the  region  |u|>Qh  are  negligible.  In  Sections  2. 4. 2. 3  and 
D.4,  this  bandwidth  is  referred  to  as  the  one-sided  main  lobe  bandwidth. 


B.2.2  MEMORYLESS  NONLINEARITIES 

A  device  is  roemoryless  if  its  output  at  any  given  time  depends  only 
upon  the  input  at  that  time.  For  example,  let  the  input  to  a  device  be 
y(t)  and  the  output  be  w(t).  The  device  is  memoryless  if  w(t)  at  some 
time  tg  depends  only  upon  y(tQ)» 

Let  the  waveform  a(t)  be  the  output  of  a  device  in  response  to  any 
input  waveform  a(t),  and  3(t)  be  the  response  to  b(t).  The  device  is 
nonlinear  in  the  system  theory  sense  if  the  input  cia(t)+C2b(t)  does  not 
yield  an  output  c^a(t)+C23 (t) ,  where  cj  and  C2  are  constants. 

An  example  of  a  device  which  is  both  memoryless  and  nonlinear  is  the 


square  law  device  described  by  the  input-output  relationship: 


Another  meraoryless  nonlinearity  Is  the  full  wave  piecewise  linear  device 
described  by  the  input-output  relationship: 

w2(t)  -  |y(t) | .  (B.3) 

The  half  wave  piecewise  linear  device  is  a  memoryless  nonlinearity 
described  by  the  input-output  relationship: 

W3U)  -  [y(t)/2l  +  [|y(t)|/2j.  (B.4) 

The  half  wave  piecewise  linear  device  can  be  followed  by  a  square  law 
device  to  implement  a  half  wave  square  law  device  with  input-output 
relationship: 

w4(t)  -  ly2(t)  +  y(t)|y(t)| J/2.  (B.5) 

In  addition  to  those  described  above,  other  devices  such  as  exponential 
and  square  root  are  often  useful. 

The  F/D  of  Fig.  B.l  will  accomplish  demodulation  so  long  as  the 
memoryless  nonlinearity  does  not  possess  an  input-output  relationship 
with  odd  function  symmetry  (Taub  and  Schilling  [28J).  Devices  with  odd 
function  symmetry  produce  signals  with  equal  positive  and  negative 
excursions  which  may  lead  to  a  smoothing  filter  output  of  zero.  Note 
that  the  half  wave  square  law  device  of  Equation  B.5  consists  of  an  even 
function  y2(t)/2  and  an  odd  function  y(t) | y(t) | /2.  Since  y(t)  is  a 
narrowband  signal,  contributions  from  the  odd  function  can  be  eliminated 
by  the  smoothing  filter.  Thus,  a  smoothed  version  of  the  square  law 
device  output  wj(t)  differs  only  by  a  factor  of  two  from  a  smoothed 
version  of  the  half  wave  square  law  device  output  w4(t). 


B.2.3  SMOOTHING  FILTERS 


The  smoothing  filter  can  be  implemented  as  a  LTI  lowpass  filter  with 
bandwidth  fts.  The  smoothing  filter  impulse  response  hg(t)  is  not 
necessarily  the  same  as  the  window  function  h(t). 

In  many  applications  it  is  desirable  to  use  a  F/D  whose  output  is 
always  positive.  For  example,  a  F/D  using  a  square  law  device  may  be 
used  to  measure  average  power  spectra  (Flanagan  [1)>,  and  a  F/D  with  a 
half  wave  square  law  device  can  be  used  to  model  auditory  nerve  firing 
patterns  (Siebert  [18]).  Since  negative  power  spectra  and  negative 
firing  rates  are  meaningless,  a  positive  F/D  output  is  required.  Also, 
the  F/D  is  often  followed  by  a  square  root  device  (Sondhi,  Schmidt,  and 
Rabiner  [54])  or  a  logarithmic  amplifier  (Searle  [43]).  A  positive  F/D 
output  is  clearly  required  in  such  cases.  Unless  otherwise  stated,  a 
positive  F/D  output  will  be  assumed. 

The  requirement  for  positive  F/D  output  may  place  a  restriction  on 
the  smoothing  filter  design.  Assume  the  memoryless  nonlinearity  output 
is  always  positive.  From  the  F/D  subsystem  shown  in  Fig.  B.l,  it  follows 
that  the  smoothing  filter  must  produce  a  positive  output  v(t)  in  response 
to  a  positive  input  w(t).  Since  any  LTI  filter  with  positive  impulse 
response  will  produce  a  positive  output  given  a  positive  input,  the 
restrictions  CXw(t)  and  0<hs(t)  are  sufficient  to  ensure  that  (Kv(t)  for 
all  t.  Although  these  restrictions  are  not  always  necessary  (a 
counter-example  is  given  in  Section  2. 4. 2. 2)  they  are  practical  design 
guidelines  which  conveniently  guarantee  a  positive  F/D  output. 


In  certain  cases  it  is  easily  shown  that  a  smoothing  filter  with 
positive  impulse  response  is  necessary,  as  well  as  sufficient,  to 

guarantee  a  positive  F/D  output.  For  example,  assume  the  bandpass  filter 
has  no  spectral  zeros  and  the  memoryless  nonlinearity  is  a  full  wave 

piecewise  linear  device.  Choosing  x(t)  so  that  the  product  of  its 

Fourier  transform  and  the  bandpass  filter  transfer  function  are  unity 
leads  to  an  impulse  at  the  bandpass  filter  output,  y(t)=6(t).  An  impulse 
also  appears  at  the  smoothing  filter  input,  w(t)=6(t).  Since  the 

resulting  subsystem  output  v(t)  must  be  positive,  the  smoothing  filter 
must  have  a  positive  Impulse  response. 

An  ideal  smoothing  filter  is  a  LTI  filter  having  positive  impulse 
response  and  constant  magnitude  across  its  lowpass  bandwidth.  Although 
the  ideal  smoothing  filter  is  a  useful  concept,  it  can  be  shown  that  such 
a  filter  does  not  exist  (Siebert  [55]).  When  hs(t)>0,  | FT { hs ( C > } } 

evaluated  at  the  frequency  ft =0  is  strictly  greater  than  | FT{ hs ( t ) } | 
evaluated  at  any  other  frequency  ft*  0. 

Despite  the  absence  of  an  ideal  smoothing  filter,  a  variety  of 
practical  smoothing  filter  designs  are  possible.  For  example, 

h8(t)  “  (sin2(ftst/2) ]/(w t)2  (B.6) 

has  a  Fourier  transform  which  is  zero  for  |ft|>ftg.  Another  design  is  the 
causal  filter 

h8(t)  -  BtV®*,  0<t 


0,  otherwise, 


(B.7) 


where  a  and  f5  are  positive  real  constants.  This  filter  Is  discussed 
further  Chapter  2.  Channel  vocoders  sometimes  use  Bessel  filters  which 
have  a  small  negative  overshoot  In  the  Impulse  response  (Sondhi,  et  al 
[54]).  To  maintain  an  overall  positive  impulse  response,  a  small  positive 
offset  must  be  added  to  the  Bessel  filter  Impulse  response.  When  a 
finite  duration  impulse  response  Is  required,  a  function  such  as  the 
Hamming  window  may  be  used  to  truncate  the  impulse  responses  of  Equations 
B.6  or  B.7  (Rabiner  and  Gold  [56]).  Alternatively,  since  a  Hamming  window 
is  the  impulse  response  of  a  lowpass  filter  and  is  always  positive.  It 
may  be  directly  used  as  a  smoothing  filter. 


B.3  CONTINUOUS-TIME  FILTER/DETECTOR  RESPONSES 

In  this  section,  responses  of  several  F/D  subsystems  to  a  variety  of 
signals  are  examined  in  detail.  Three  commonly  used  continuous-time  F/D 
subsystems  which  differ  in  memoryless  nonlinearity  type  and  smoothing 
filter  bandwidth  are  shown  in  Fig.  B.3.  Smoothing  filter  bandwidths  for 
the  square  law,  full  wave  piecewise  linear,  and  half  wave  piecewise 
linear  detectors  are  ftsi,  ^s2»  an<*  ^s3>  respectively.  For  convenience, 
all  three  LTI  smoothing  filters  are  assumed  to  have  the  ideal 
characteristics  of  unity  gain,  zero  delay,  and  positive  output  given  a 
positive  input. 

Fig.  B.3a  depicts  a  F/D  subsystem  using  a  square  law  device  in  the 
detector.  A  square  root  device  is  present  so  that  output  levels  are  the 
same  order  of  magnitude  as  those  given  by  detectors  using  full  wave  or 
half  wave  piecewise  linear  devices.  If  the  F/D  outputs  are  followed  by  a 
logarithmic  amplifier,  as  is  often  the  case  in  practice,  then  power  law 
devices  at  the  output  have  little  effect  on  the  final  result. 

Fig.  B.3b  depicts  a  detector  using  a  full  wave  piecewise  linear 
device,  which  is  drawn  as  a  square  law  device  followed  by  a  square  root 
device.  The  half  wave  piecewise  linear  device  of  Fig.  B.3c  is  represented 
by  a  diode  symbol. 


INPUT 


h(t)sln(&ct) 


!(  )2 


hglU) 


(a)  Square  Law  Device 


INPUT 


(b)  Full  Wave  Piecewise  Linear  Device 


INPUT 


(c)  Half  Wave  Piecewise  Linear  Device 


OUTPUT 


OUTPUT 


OUTPUT 


Figure  B.3:  Commonly  Used  Filter/Detector  Subsystems 


B.3.1  SQUARE  LAW  DETECTOR  RESPONSE  TO  ARBITRARY  INPUTS 


For  any  arbitrary  input  signal  x(t),  the  spectrum  of  y(t)  is 
bandlimited  to  the  region  I^c"^h  as  shown  in  Fig.  B.4a.  Note 
that  the  graphs  of  Fig.  B.4  do  not  represent  the  exact  Fourier  transform 
of  any  particular  signal,  but  indicate  regions  where  non-negligible 
spectral  components  may  exist.  From  the  modulation  property  of  Fourier 
transforms  (see  Appendix  A),  it  follows  that  the  spectrum  of  w^(t) 
consists  of  low  and  high  frequency  regions  as  shown  in  Fig.  B.4b.  If  the 
smoothing  filter  bandwidth  is  chosen  so  that  2fth<iisi<2Wc-2fth,  then  no  low 
frequency  information  is  lost  but  all  high  frequency  components  are 
eliminated  from  vg(t). 


B.3.2  FULL  AND  HALF  WAVE  PIECEWISE  LINEAR  DETECTOR  RESPONSES  TO 
ARBITRARY  INPUTS 

In  this  section,  it  is  shown  that  the  full  wave  piecewise  linear 
detector  output  W2(t)=|y(t)j  can  be  expanded  in  terms  of  even  powers  of 
y(t).  The  spectrum  of  |y(t)|  can  therefore  be  determined  from  the 
spectrum  of  y(t)  by  repeated  application  of  the  modulation  property.  The 
result  is  a  new  Fourier  transform  operation  which,  given  the  spectrum  of 
a  signal,  determines  the  spectrum  of  the  absolute  value  of  the  signal.  It 
follows  from  Equation  B.4  that  a  similar  result  may  be  applied  to  the 
half  wave  piecewise  linear  detector. 
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The  Input-output  characteristic  of  a  full  wave  piecewise  linear 


device  is  given  by: 

w2(t)  =  |y(t) | .  (B.8) 

Since  the  device  is  memoryless,  time  dependence  of  the  signals  is 
unimportant  and  the  time  parameter  may  be  suppressed.  Equation  B.8  can 
thus  be  written  as: 

w2  =  I y I •  (B.9) 

Assume  the  device  input  amplitude  is  limited  to  some  arbitrary 
finite  range  -R<y>;R.  Using  the  Fourier  series  expansion  for  a  triangle 
wave,  which  is  identical  to  the  input-output  characteristic  over  the 
specified  range, 

oo 

W2  =  (R/2)  -  (AR/tt2)  l  (2n-l)-2  cos  [  (2n— l)my/R] .  (B.10) 

n=  1 

The  cosine  function  can  be  expanded  via  a  power  series  for  any  y: 

oo 

cos[(2n-l)Tiy/R]  =  ]  (-1  )m[  (2n-l)iT y/R]  2m/(2m)  ! .  (B.ll) 

m=0 

Substitution  of  Equation  B.ll  into  B.10  yields: 

OO 

w2  =  I  amy2m  (®.12) 

m=l 

where 

OO 

am  =  -(AR/tt  2)  ( (-l)m(n  /R)2m/ (2m) !  )  [  £  Un-l)2”"2] .  (B.13) 

n=  1 


L9 


The  full  wave  piecewise  linear  device  output  W2(t)  can  thus  be  expressed 
in  terms  of  even  powers  of  Its  Input  y(t).  Note,  however,  that  the 
coefficients  am  have  infinite  values. 

If  the  number  of  terms  in  the  series  expansion  is  limited, 

m=l,2 . M,  an  appropriate  set  of  finite  values  for  am  can  be  obtained 

from  truncated  versions  of  the  Fourier  series  of  Equation  B.10  and  the 
power  series  of  Equation  B.ll.  The  Fourier  series  converges  rapidly,  and 
relatively  few  terms  are  required  to  obtain  results  within  a  specified 
accuracy.  Each  of  the  cosine  terms  in  the  truncated  Fourier  series  is  in 
turn  expanded  by  the  slowly  converging  power  series.  The  cosine 

expansions  must  contain  enough  terms  so  that  error  given  by  truncation  of 
the  original  Fourier  series  is  not  significantly  increased.  A  large 
value  of  M  is  thus  required  to  obtain  a  reasonably  accurate 

approximation.  Any  constant  term  in  the  resulting  expansion  should  be 
eliminated  so  that  the  approximation  produces  zero  output  in  response  to 
zero  input. 

Appropriate  coefficient  values  can  also  be  computed  using  a  minimum 
mean  squared  error  (MMSE)  criterion.  The  mean  squared  approximation 

error  is  given  by: 

+R  N 

eM  ■  /  t I y I  -  l  amy2ral2dy.  (B.14) 

-R  m=l 

To  obtain  the  value  of  any  particular  coefficient  ai  which  minimizes  the 
error  for  1*1, 2, ...,M: 


(B. 15) 


dCty/daj'  ■  0 

R  M 

-  -4/  (y  -  l  amy2m)y2idy. 

0  m«l 

The  solution  Is  given  by 

M 

1/(1+1)  “  l  amR2m_1/(nri-i+.5),  (B.16) 

m«l 

which  generates  M  equations  In  M  unknowns  and  thereby  specifies  am  for 
m* 1 ,2 , • • • |M« 

As  an  example,  let  R“1  and  M*7.  Solving  Equation  B.16  yields: 

w2(t)  =  1.6746y2(t)  -  .078942y4(t)  -  .28032y6(t)  -  .72214y8(t) 

+  .024750yl0(t)  -  .0013560y12(t)  -  .0088671y14(t) ,  (B.17) 

which  is  the  MMSE  approximation  for  w2(t)*| y(t) |  on  the  interval 
-l<y(t)<l  when  seven  terms  are  used.  Evaluation  of  Equation  B.17  with 
y(t)«l  yields  w2(t)=.608,  which  is  a  poor  approximation.  Repeating  the 
procedure  with  M=10  yields: 

w2(t)  2  5.8239y2(t)  -  34.0175y4(t)  +  108.3705y6(t)  -  156.0335y8(t) 

+  74.8383y10(t)  -  16.8961y12(t)  +  1 15.5607y14(t) 

-  127.7208y16(t)  +  7.3678y18(t)  +  23.7314y20(t) .  (B.18) 

Evaluation  of  Equation  B.18  with  y(t)**l  yields  w2(t)= 1»025,  which  is  a 
better  approximation. 
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Note  chat  a  large  number  of  terms  must  be  used  in  order  to  obtain 


reasonable  results.  Thus,  to  determine  the  spectrum  of  |y(t)|  from  the 
spectrum  of  y(t),  the  modulation  property  of  Fourier  transforms  must  be 
applied  many  times.  Since  the  coefficients  must  be  accurate  to  many 
significant  digits  and  a  high  degree  of  precision  must  be  maintained  in 
all  computations,  this  approach  is  mainly  of  theoretical  interest  and  has 
limited  practical  value. 


B.3.3  RELATIONSHIP  BETWEEN  FULL  WAVE  AND  HALF  WAVE  PIECEWISE  LINEAR 
DETECTORS  FOR  ARBITRARY  INPUTS 

Under  certain  conditions,  outputs  from  F/D  subsystems  using  either 
full  wave  or  half  wave  piecewise  linear  devices  are  the  same  (within  a 
scale  factor)  for  any  arbitrary  input  signal.  From  Equations  B.3  and  B.4 
it  follows  that: 


W3U)  =  Iy(t)/2]  +  [w2(t)/2] . 


(B. 19) 


The  spectrum  of  y(t)  lies  in  the  region  ftc~fth<  1^  I  <^c^h  as  shown 
Fig.  B.4a.  If  the  smoothing  filter  bandwidth  is  chosen  so  that 
0<Si s3<^c^h»  then  the  bandpass  component  y(t)/2  is  eliminated.  Setting 
f]g2^s3  then  results  in  F/D  outputs  v2(t)  and  vj(t)  which  differ  by  a 


B.3.4  RELATIONSHIP  BETWEEN  SQUARE  LAW  AND  FULL  WAVE  PIECEWISE  LINEAR 
DETECTORS  FOR  ARBITRARY  INPUTS 

The  square  law  detector  shown  in  Fig.  B.3a  lowpass  filters  the 
waveform  wj(t)  and  takes  the  square  root  of  the  result  to  obtain  output 
V}(t).  The  full  wave  piecewise  linear  detector  of  Fig.  B.3b  takes  the 
square  root  of  wj(t)  and  lowpass  filters  the  result  to  obtain  output 
V2(t)»  Since  lowpass  filter  and  square  root  operations  are  not 
interchangable,  the  outputs  vj(t)  and  V2<t)  are  not  equal  in  general.  It 
will  be  shown,  however,  that  given  certain  restrictions  these  two  outputs 
are  similar  for  a  variety  of  different  input  waveforms  x(t).  Note  that 
while  V2(t)  is  a  bandlimited  signal,  vj(t)  is  not  necessarily 
bandlimited.  Therefore,  a  large  smoothing  filter  bandwidth  fts2  may  be 
required  in  order  that  V2(t)=vj(t). 
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B.3.5  NOISE  RESPONSE 


Let  the  input  to  each  F/D  subsystem  of  Fig.  B.3,  x(t),  be  a  white 
Gaussian  noise  process  with  variance  Sx(ft)“4A.  If  the  window  function 
h(t)  is  the  impulse  response  of  a  unity  gain  ideal  LTI  lowpass  filter 
with  cutoff  frequency  the  bandpass  filter  output  y(t)  will  be  a 

bandlimited  Gaussian  noise  process  with  power  spectral  density 

Sy(ft)  -  Sx(il)|FT{h(t)sin(i2ct)}  |2 
=  A,  |fl  |  <CSIc'h'2^J 

=0,  otherwise.  (B.21) 

Note  that  the  spectral  height  of  Sy(Q  )  equals  the  input  variance  reduced 
by  a  factor  of  four,  as  shown  in  Fig.  B.5a.  Power  spectral  densities  for 
noise  processes  at  the  output  of  each  memoryless  nonlinearity  are  shown 
in  Figs.  B.5b,  c,  and  d  (Davenport  and  Root  [57J;  Papoulis  [58]). 

Given  certain  restrictions,  comparable  F/D  noise  responses  can  be 
obtained  over  a  specified  range  of  frequencies.  To  avoid  loss  of  low 
frequency  information  while  eliminating  high  frequency  components,  let 
2^h^s  1  c“^h  f°r  tJie  square  law  and  full  wave  piecewise  linear 

detectors  while  2nh<^s3<^c"^h  for  the  half  wave  piecewise  linear 
detector.  Under  these  restrictions,  the  full  wave  and  half  wave 

piecewise  linear  detector  outputs  will  differ  only  by  a  scale  factor. 
Note  that  the  minimum  smoothing  filter  bandwidth,  2£2ft  in  all  cases,  is 
twice  the  bandwidth  normally  used  for  detection  of  Amplitude  Modulation 
(AM)  signals  (Siebert  [55]).  The  wide  bandwidth  is  required  because  the 
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(b)  Square  Law  Device  Output  Power  Spectrum 
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input  signal  is  not  generally  AM  in  nature,  and  has  neither  the  carrier 
nor  the  symmetry  inherent  in  AM  signals. 

Given  the  stated  restrictions,  noise  responses  of  F/D  subsystems 
using  square  law  and  full  wave  piecewise  linear  devices  are  similar  in 
many  ways.  From  Figs.  B.5b  and  B.5c  it  is  apparent  that  the  noise 
processes  w^(t)  and  W2(t),  and  therefore  vg(t)  and  V2(t),  have  comparable 
power  spectral  density  shapes.  However,  total  area  under  the  square  law 
device  power  spectral  density  curve  is  proportional  to  the  square  of  the 
input  variance,  while  the  area  under  the  curve  for  the  full  wave 
piecewise  linear  device  is  directly  proportional  to  the  input  variance. 
Due  to  the  square  root  device  shown  in  Fig  B.3a,  the  zero  frequency 
component  of  vj(t)  is  proportional  to  the  input  variance.  The  zero 
frequency  component  of  V2(t)  is  also  proportional  to  the  input  variance. 
Thus,  in  applications  where  a  F/D  subsystem  is  used  to  measure  noise 
process  characteristics,  the  zero  frequency  component  of  the  F/D  output 
is  often  the  only  quantity  of  interest  (see  Sections  D.2  and  D.5  for 
such  applications). 


m 


Now  consider  Che  F/D  subsystem  of  Fig.  B.3b  which  uses  a  cull  wave 


piecewise  linear  device.  If  Che  inpuc  is  an  impulse  then,  by  Fourier 
series  expansion  (Spiegel  [ 59 j ) , 

W2<t)  =  |h(t)  |  |sin(iJct)| 

CD 

*  ]h(c)  |;  (2/u  )  +  (4/tt  )  'l  [cos(2nf2ct>  ]  /  ( 1 — 4n^ )  ;  (B.25) 

n=l 

Frequency  regions  occupied  by  FT;  W2( C ) |  are  shown  in  Fig.  B.hb.  If  the 
lowpass  smoothing  filter  has  cutoff  frequency  i,s2  such  that 
>•  I  h  ' s2','^*c"'i  '  h  1  *  then  the  F/D  output  is: 

v  -o  t )  =  2  '  h  ( t )  ■  h  i  B .  2  6  } 

Finally,  consider  the  F/D  subsystem  of  Fig.  B. 3c  which  uses  a  half 
wave  piecewise  linear  device.  If  the  input  is  an  impulse,  then 

w  r.  •  -  I  y  •  t  i  *  >  -r  |  v(  t ) .  /  2  ’ 

-  ,  :  :  s  ‘  c  '■ ...  -  t  ■■  ^ 


-  r';;  ■  4  •  •  +  .2-  j  ms  (  2  til.  ..1  ‘  i  ;  i; .  2  7  ; 

Frequency  regions  occupied  tv-’  FT-  wy.r.1;  are  show:--  i  ’  Fig.  B.oc ,  net  the 

lowpa-s  smoothing  :;ite:  have  cut  of  t  frequency  where  -s  '‘/'-•c- -n  aa‘^ 

..  ^  •:  L. p  .  Ine  ualr  wave  piecewise  linear  detector  output  is  then 

:>r  o’»or  t  i  ca.  t-  me  r  a  i  .  wave  piecewise  linear  detector  output. 


Under  the  stated  restrictions  and  assumptions,  the  impulse  responses  of 
all  three  F/D  subsystems  differ  only  by  a  scale  factor. 

The  results  of  this  section  can  be  applied  to  determine  the  impulse 
response  of  a  F/D  subsystem  using  a  half  wave  square  law  device,  as  used 
in  Chapter  2.  It  follows  from  Equations  B.2,  B.3,  B.5,  and  the  modulation 
property  that 

FT{w4(t)}  =  [ FT{  wj(t)}  J/2  +  [FT{  y( t )}  *FT{  w2( t)}  ] /4ir .  (B.29) 

Spectral  regions  occupied  by  [FT{  wj(t)} J/2  are  shown  in  Fig.  B.6a.  The 
lowest  frequency  spectral  region  occupied  by  [ FT{  y( t ) }  *FT{  W2( t ) }  ]  / 4tt  is 
Ihl^h^l^  l^c-^  | h | h »  as  can  be  seen  by  convolving  Figs.  B.4a  and 
B.6b.  Thus  if  a  smoothing  filter  with  bandwidth  £2s4  is  chosen  such  that 
2fth<^s4^2c-^  |h|“^h  and  2ft h<D 34^22 c“Z}h>  the  half  wave  square  law  F/D 
output  is  h2(t)/4. 


B.3.7  SINUSOIDAL  RESPONSE 


Let  the  input  to  each  F/D  subsystem,  x(t),  be  a  sinusoidal  waveform. 
Because  the  bandpass  filter  is  LTI,  the  filter  output  y(t)  is  also 
sinusoidal.  The  waveform  will  be  changed  in  amplitude  and  phase  if  the 
bandpass  filter  is  non-ideal.  Assume 


y(t)  =  AjSinCft  jt) , 


(B.30) 


where  ft  Then 


wi(t)  =  (Aj)2( l-cos2ft  it)/2. 


(B.31) 


For  0<ftsl<2ftc-2fth, 


v0(t)  =  (Ai)2/2 


(B. 32) 


v^t)  =  | Ax  |  //7. 


(B.33) 


From  the  Fourier  series  expansion  of  Equation  B.25,  it  follows  that 


w2(t)  =  |Ai  |  {  (2/ir )  +  (4/ir  )  l  [cos(  2nft  1 1)  ] /( l-4n2)} 

n=l 


(B.34) 


and,  for  0<fts2<2ftc-2fth, 


V2(t)  ■  2|Ai | h  . 


(B.35) 


Similarly, 


(B.36) 


W3<t)  "  (Aislnftit)/2  +  |  A  j  |  {  ( 1  /it  )  +  (2/tt)  \  [cos(2nftit)  ]/( l-4n2)  J  . 


Thus,  for  UC.g-^c-Vh* 


v^(  t )  =  i A, | h  .  (B.37) 

Under  the  stated  restrictions  and  assumptions,  the  sinusoidal  responses 
of  all  three  F/D  subsystems  differ  only  by  a  scale  factor. 


B.3.S  SINUSOIDAL  PAIR  RESPONSE 

Let  x(t)  be  a  sinusoidal  pair.  Because  the  bandpass  filter  is  LTI, 
v(t)  is  also  a  sinusoidal  pair.  Assume 

y  ( t  )  =  A2(cos(.i|t)  -  cos(il2t)]  (B.38) 

where  s:  c— ^  h<"‘’  1 *“  2<^ji  c+i; h  ar*d  >1  2^^  1  •  Thus 

(B.  39.) 

wj(t)  =  ( A  2  )  “  (  1  ~  cos(a  j-,1  2)C  -  cos(il  j-bl  2^c  +  (cos2ii  j  t+cos  2ii  2  t  ) /2  J . 
Let  2.h<..sI<2.c-2;h.  Then 

V|jit)  =  ( A  2  )  ~  (  1  "  cos(,4  j-sl  2)1 1  (B.40.) 


and 


vj(t)  =  »■'"?!  A2s  in[  (,t  I  I  •  (B.41) 

The  waveform  vj(t)  of  Equation  B . 4 1  is  not  strictly  bandlimited. 
However,  an  effective  bandwidth  Ue  may  be  chosen  such  that,  for  practical 
purposes,  frequency  components  of  v;(t)  in  the  region  >ie<  L.  !  are 
negligible.  Since 


B.4.  CONCLUSION 


In  this  appendix,  F/D  subsystem  components  have  been  described  and 
three  common  F/D  subsystems  have  been  investigated  in  detail.  For 
arbitrary  input  signals,  the  response  of  a  F/D  using  a  square  law  device 
is  easily  determined.  Responses  of  detectors  using  full  and  half  wave 
piecewise  linear  devices  are  not  easily  determined  in  general.  It  has 
been  shown  that,  under  certain  conditions,  the  outputs  of  detectors  using 
full  and  half  wave  piecewise  linear  devices  differ  only  by  a  scale 
factor.  It  was  also  shown  that  a  F/D  subsystem  using  a  square  law 
detector  can  be  turned  into  a  F/D  subsystem  using  a  full  wave  piecewise 
linear  device  by  interchanging  square  root  and  lowpass  filter  operations 
(see  Fig.  B.3).  Thus,  the  outputs  of  these  subsystems  are  not  the  same 
in  general.  Under  certain  restrictive  conditions,  however,  the 
subsystems  have  similar  responses  to  noise,  impulse,  sinusoid,  and 
sinusoidal  pair  inputs.  These  results  will  be  used  in  Section  D.3  to 
relate  spectrograms  with  the  spectrogram-like  representations  generated 
from  Short-Time  Fourier  Transform  magnitude. 
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APPENDIX  C 


GENERALIZED  SHORT-TIME  FOURIER  TRANSFORM  COMPUTATION 


C.l  GSTFT  ANALYSIS  USING  FIR  WINDOWS 

Assume  that  each  window  function  is  the  Finite-duration  Impulse 
Response  (FIR)  of  a  lowpass  filter.  Let  the  set  of  window  functions 
hk(n)  be  zero  outside  the  range  OSnSM^-l.  Note  that  each  window  function 
may  have  a  different  duration  M^.  Window  functions  can  be  defined  by  an 
equation,  as  for  a  Hamming  window,  or  values  may  simply  be  defined  on  a 
point  by  point  basis. 


When  a  FIR  window  is  used,  the  GSTFT  magnitude  squared  is  given  by: 


jco  ^  Mj^- 1 

|Xn(e  )|2  =  [  £  xCn-mJh^CmJcosCw^m) ] 2 

ra=0 


Mjj-l 

+  (  x(n-m)hjc(m)sin(a>jcm)]2>  (C.l) 
m=l 

where  k=l,2,...,K.  One  of  the  set  of  K  F/D  subsystems  is  shown  in  Fig, 
C.l.  In  this  figure,  unit  delays  are  denoted  by  z~ *  and  amplifier 
symbols  (triangles)  indicate  multiplication  by  a  constant.  The  F/D  of 
Fig.  C.l  is  a  discrete-time  version  of  Fig.  2.7b  with  the  filters  drawn 
in  detail  to  show  their  FIR  structure. 

The  F/D  implementation  shown  in  Fig.  C.l  (or  2.7b)  is  of  special 
interest  when  the  data,  x(n) ,  has  been  quantized  to  one  bit.  Such  a 
situation  arises  when  speech  data  has  been  encoded  using  linear  Delta 
Modulation  (Steele  [60]).  In  this  case  the  bandpass  filters  can  be 
implemented  without  use  of  multiplication;  ie. ,  multiplication  by  zero  or 
one  is  trivial.  Such  a  structure  is  therefore  suitable  for  real-time 
speech  analysis  systems  implemented  with  microcomputers  (Anderson  [61]). 
Note  that  the  same  computational  efficiency  is  not  achieved  by  the  system 
of  Fig.  2.7a  where  the  data  is  modulated  prior  to  filtering. 
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C.2  GSTFT  ANALYSIS  USING  HR  WINDOWS 


Assume  that  each  window  function  is  the  Infinite-duration  Impulse 
Response  (IIR)  of  a  lowpass  filter.  Let  the  set  of  window  functions 
hk(n)  be  zero  for  n<i ,  where  i  is  an  arbitrary  integer  constant.  For 
n>  i , 


'*  k  Rk 

hk(n)  =  l  Pk^  )hfc(n-!ji )  +  l  qk(r)6(n-r)  (C.2) 

ij/»l  r*i 


where  the  set  of  coefficients  pk  and  qk  are  real  constants  with  qk(i)*0. 
From  Equation  C.2  it  can  be  seen  that  the  window  functions  are 
right-sided  sequences  (Oppenheim  and  Schafer  [31])  with  first  nonzero 
value  hk(i)=qk(D*  This  choice  for  relative  time  alignment  of  the  window 
functions,  although  arbitrary,  serves  to  simplify  the  synthesis  equations 
of  Chapter  3. 

Each  IIR  window  function  described  by  Equation  C.2  has  a  rational 
z-transform  (see  Appendix  A): 


Rk 

Hk(z)  =  n. 

i  -  y  Pk(*>«-» 
<|;  =  1 


(C.3) 


The  window  function  spectral  zeros  can  be  determined  by  factoring  a 
polynomial  Involving  the  set  of  qk  coefficients,  and  poles  are  similarly 
obtained  from  the  pk  coefficients. 


Since  a  FIR  filter  has  zeros  but  no  poles  in  its  system  function, 
the  HR  window  function  includes  the  FIR  analysis  window  as  a  special 
case.  To  eliminate  the  poles,  let  p^Cip ) for  all  values  of  k  and 
For  convenience  let  1*0  and  R^M^-l.  Equation  C.2  then  becomes 

Mk-1 

hk(n)  *  I  qk(r)6 (n-r) ,  (C.4) 

r  »0 

from  which  it  follows  that 
hk(n)  ■  qk(n),  CXtKMk-l 

*  0,  otherwise.  (C.5) 

The  FIR  window  discussed  in  Section  C.l  is  thus  a  special  case  of  the  IIR 
window  function  defined  by  Equation  C.2. 

The  recursive  formula  for  the  GSTFT,  which  results  when  the  IIR 
window  is  substituted  into  the  defining  equation  for  the  GSTFT  (see 
Equation  2.29)  is  given  by: 

>k  yk  >k  \  -j<Dk(n-r) 

Xn(e  )  ■  I  pk(tJ))Xnni)(e  )  +  l  qk(r)x(n-r)e  .  (C.6) 

ij>»l  Y  r=i 

The  recursion  of  Equation  C.6  can  be  implemented  using  a  variety  of 
filter  configurations  (Oppenheim  and  Schafer  131];  Rabiner  and  Gold 
(56]).  For  example,  a  "direct  form  two"  implementation  can  be  obtained 
by  defining  an  auxiliary  sequence  Lk(n),  where 


Lk(n)  =  x(n)e 


+  1  Pk^  )Lk(n-i) 


(C.7) 


The  GSTFT  is  then  computed  by 
ju>k  Rk 

Xn(e  )  =  l  qk(r)Lk(n-r).  (C.8) 

r=i 

The  required  sine  and  cosine  sequences  can  also  be  computed 
recursively,  if  desired,  since 

cos  u>kn  =  (cos  wk)[cos  <Dk(n-l)]  -  (sin  u>k)[sin  u»k(n-l)J  (C.9) 

and 

sin  ukn  *  (sin  a)k)[cos  to lc( n— 1 )  J  +  (cos  wk)[sin  wk(n-l)].  (C.10) 

It  should  be  noted  that  the  sine  and  cosine  sequences  computed  via  the 
recursion  may  become  less  accurate  with  increasing  n.  This  problem  can 
be  overcome  by  periodically  resetting  the  recursion  variables  to  their 
correct  values.  Correct  values  for  the  reset  operation  may  be  obtained 
from  a  similar,  but  lower  frequency,  recursion  (Gold  and  Rader  [62]). 

Fig.  C.2  depicts  a  F/D  subsystem  using  a  direct  form  two  filter 
implementation  and  recursive  sine  and  cosine  generation.  Parameters  for 
this  subsystem  are  i=l,  and  R^^.  Note  that  the  F/D  of  Fig.  C.2  is 

a  discrete-time  version  of  Fig.  2.7a  with  the  filters  drawn  in  detail  to 
show  their  HR  structure.  This  implementation  is  suitable  for  real-time 
speech  analysis  systems,  and  may  be  used  in  the  system  described  in 
Chapter  2. 


Filter/Detector  (F/D)  subsystems  are  used  in  a  variety  of  speech 
analysis  and  synthesis  systems.  By  varying  the  bandpass  filter 
characteristics,  memoryless  nonlinearity  type,  and  smoothing  filter 
cutoff  frequency,  relationships  between  several  speech  processing 
techniques  can  be  examined. 

In  this  appendix,  the  new  relationship  between  Short-Time  Fourier 
Transform  (STFT)  magnitude  squared  and  F/D  subsystems,  as  derived  in 
Chapter  2,  is  used  to  describe  channel  vocoder  operation.  The 
relationship  is  also  used  to  explain  similarities  between  spectrograms 


and  Che  spectrogram-like  representations  generated  from  STFT  magnitude, 
give  a  new  F/D  interpretation  to  the  FFT  magnitude,  demonstrate  the 
equivalence  between  the  discrete-time  Welch  method  of  power  spectral 
estimation  and  results  produced  by  continuous-time  power  spectral 
estimation  methods,  and  to  examine  several  approaches  to  variable 
bandwidth  analysis.  Digital  (discrete-time)  as  well  as  analog 
(continuous-time)  systems  will  be  discussed. 


D.2  CHANNEL  VOCODERS 

Channel  vocoders  are  analysis/synthesis  systems  which  model  a  speech 
signal  as  being  either  voiced  (having  a  periodic  pitch)  or  unvoiced 
(noise-like).  The  analyzer  typically  includes  a  voiced/unvoiced  (V/UV) 
decision  subsystem,  a  pitch  extractor  to  determine  the  fundamental 
frequency  of  voiced  signals,  and  a  F/D  bank.  The  synthesizer  contains  a 
pitch  generator,  noise  source,  V/UV  selector  switch,  modulators,  and 
bandpass  filters. 

A  channel  vocoder  analyzer  described  by  Rabiner  and  Gold  [ 56 J  uses 
sixteen  bandpass  filters  with  nonuniform  center  frequency  spacing  to 
analyze  the  .3-3  KHz  frequency  range.  The  bandwidth  for  the  lowest 
frequency  filter  is  125  Hz  while  a  bandwidth  of  400  Hz  is  used  for  the 
highest  frequency  filter.  The  smoothing  filter  bandwidth  is  25  Hz,  and 
is  the  same  for  all  channels  regardless  of  the  bandpass  filter 
characteristics.  Thus,  the  F/D  bank  measures  only  quasi-stationary 
aspects  of  the  speech  signal  such  as  input  signal  variance  (see  Section 
B.3.5). 
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Since  the  smoothing  filter  has  narrow  bandwidth,  F/D  outputs  are 
similar  for  a  variety  of  different  memoryless  nonlinearities.  Half  wave 
and  full  wave  piecewise  linear  devices  are  most  often  used  in  analog 
channel  vocoders  as  they  are  easily  implemented  with  diodes.  A  square 
law  device  is  often  used  in  digital  channel  vocoders.  The  increased 
dynamic  range  requirements  are  oftset  by  the  fact  that  the  square  law 
device  produces  bandlimited  signals  which  can  be  represented  digitally 
with  little  aliasing.  In  certain  digital  channel  vocoder  applications  the 
square  law  device  yields  superior  results  compared  to  the  full  wave 
linear  device  (Sondhi,  et  al  (54J). 

The  speech  analysis/synthesis  system  based  on  perception,  described 
in  Chapters  2  and  3,  can  be  viewed  as  a  channel  vocoder  which  does  not 
require  pitch  extraction.  The  data  rate  for  such  a  system,  however,  is 
much  higher  than  that  normally  associated  with  channel  vocoders.  The 
data  rate  can  be  reduced  by  placing  additional  lowpass  filters  at  each 
F/D  output.  However,  it  is  clear  from  the  results  of  Fig,  4.56  that  high 
quality  speech  cannot  be  reconstructed  from  such  lowpass  filtered  F/D 
outputs  alone,  and  additional  information  is  required.  One  method  for 
obtaining  such  information,  which  corresponds  to  a  form  of  pitch 
extraction,  is  described  in  Section  5.3.  Note  that,  contrary  to  comments 
by  Rabiner  and  Gold  [56],  a  channel  vocoder  analyzer  does  not  preserve 
the  Short-Time  Fourier  Transform  (STFT)  magnitude,  but  Instead  preserves 
a  lowpass  filtered  version  of  a  generalized  form  of  the  STFT  magnitude. 
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D.3  SPECTROGRAMS 


The  Sound  Spectrograph  machine  [63]  employs  a  measurement  system 
which  is  similar  to  the  F/D  of  Fig.  B.3b.  The  machine  uses  a  diode 
rectifier  to  implement  the  full  wave  piecewise  linear  device.  Two  types 
of  analysis  can  be  performed.  A  wideband  analysis  uses  a  bank  of 
bandpass  filters  each  having  an  effective  300  Hz  bandwidth,  while  a 
narrowband  analysis  uses  50  Hz  bandwidth  filters.  Filter  center 
frequencies  are  20  Hz  apart  (Flanagan  [1])  and  the  frequency  range  .05-7 
KHz  is  analyzed.  Each  lowpass  smoothing  filter  has  an  effective 
bandwidth  of  several  hundred  Hertz,  and  is  sufficiently  wide  to  pass  any 
envelope  frequencies  which  may  be  present  at  the  output  of  a  wideband 
filter  due  to  beating  of  adjacent  harmonic  pitch  components. 

It  was  shown  in  Section  B.3.4  that  since  lowpass  filters  and  square 
root  devices  are  not  interchangable ,  outputs  of  the  F/D  subsystems 
depicted  in  Figs.  B.3a  and  B.3b  are  not  the  same  in  general.  However, 
parameters  for  spectrogram  generation  are  such  that  similar  results  may 
be  produced  by  both  F/D  subsystems  for  a  variety  of  input  signals,  as 
shown  in  Section  B.3.  Furthermore,  since  the  STFT  magnitude  can  be  used 
to  implement  the  F/D  of  Fig.  B.3a,  the  STFT  magnitude  can  also  be  used  to 
roughly  simulate  Sound  Spectrograph  machine  operation  (Oppenheim  [64]; 
Wood  and  Oppenheim  [65];  Rabiner  and  Schafer  [3]).  Note  that,  contrary 
to  results  given  by  Flanagan  [1],  the  correct  F/D  approximation  to  STFT 
magnitude  is  shown  in  Fig.  B.3a  and  the  F/D  of  Fig.  B.3b  is  applicable 
only  under  the  restrictive  conditions  discussed  in  Section  B.3. 


D.4  SLIDING  DFT  IMPLEMENTATION  OF  THE  ST FT 


In  this  section,  it  will  be  shown  that  the  STFT  can  be  computed  by 
performing  the  Discrete  Fourier  Transform  (DFT)  on  segments  of  a  long 
data  sequence.  The  DFT  approach  is  attractive  since  it  can  be 
efficiently  implemented  via  the  Fast  Fourier  Transform  (FFT)  algorithm. 

The  DFT  is  given  by  (Oppenheim  and  Schafer  [ 3 1 J ) : 
jtjk  M~1  -joi^m 

Y(e  )  =  }  x(ra)h'(m)e  (D.l) 

m=0 


where 

wk=2irk/M  (D.2) 

and  k=0, 1 ,2 , . . , ,M-1 .  Note  that  the  analysis  frequencies  are  uniformly 
spaced.  The  DFT  window  function  h’(m)  is  finite  length,  and  is  zero 
outside  the  range  OOKM-1. 

A  sliding  DFT  analysis  is  defined  by: 

“  “jw^m 

Yn(e  )  =  \  x(n+m-M+l)h,(m)e  .  (D.3) 

m=-°° 

Although  the  summation  limits  have  infinite  range,  terms  in  the  summation 
are  nonzero  only  on  the  interval  CKntfM-l  due  to  the  finite  length  DFT 
window  function  h'(m). 


From  Equation  D.3  it  can  be  seen  that  the  sliding  DFT  segments  the 
data  sequence  x(n)  into  sections  of  length  M  and  performs  a  DFT  on 


J“k 

each  segment.  For  example,  Y^-jCe  )  is  the  DFT  of  the  first  M  signal 
points  x(0),  x(l),  x(M-l).  Other  definitions  of  the  sliding  DFT 

(Oppenheim  [ 64 J )  use  a  time  index  such  that  the  sliding  DFT  value  at  n=0 
is  the  DFT  of  the  first  M  signal  points.  Many  variations  are  possible, 
but  the  resulting  differences  are  unimportant  and  the  definition  of 
Equation  D.3  is  chosen  for  convenience. 

The  sliding  DFT  need  not  be  computed  for  every  time  n.  Often,  to 
decrease  computation  time  and  data  storage  requirements,  only  samples  of 
the  sliding  DFT  are  desired.  In  this  case,  it  becomes  the  hopped  DFT 
discussed  by  Rabiner  and  Gold  [56j. 

To  relate  the  sliding  DFT  to  the  STFT,  define  a  new  window  function 
h(m)  to  be  a  time-reversed  and  delayed  version  of  h'(m);  te., 

h ' (m)=h(M-l-m)  (D.4) 

for  all  m.  Since  many  window  functions  used  in  conjunction  with  the  DFT 
are  symmetrical,  the  time-reversed  and  delayed  window  is  often  the  same 
as  the  original  window.  By  substitution  into  Equation  D.3: 

juj^  00  -jyj^m 

Yn(e  )  =  1  x(n+m-M+l )h(M-l-m)e 

m=-«’D 

■*>  -  ju>  ( M- 1  -m ) 

*  x(n-m)h(ra)e 

m  =  -“ 

*=  e  Xn(e  ),  (D.5) 
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where  Xn(e  )  is  the  discrete-time  STFT  (a  special  case  of  the 
Generalized  STFT)  evaluated  at  a  fixed  frequency  as  given  by  Equation 
2.21.  This  result  leads  to  the  following  procedure  for  computing  the 
STFT  via  the  sliding  DFT: 

1.  Form  a  time-reversed  and  delayed  version  of  the  STFT  window 
function,  h(n),  and  call  it  h'(n). 

2.  Pre-multiply  a  data  segment  by  h'(n). 

3.  Perform  a  DFT  on  the  windowed  segment  by  using  the  FFT  algorithm. 

4.  Pos t-multiply  the  results  by  a  time-varying  complex  exponential. 

The  complex  exponential  post-multiplication  step  converts  sliding 
DFT  outputs,  which  are  bandpass  functions,  into  lowpass  STFT  results.  If 
only  magnitudes  are  computed,  then  step  # 4  is  unnecessary  since: 

J<*>k  >k 

| Yn(e  )|  =  |Xn(e  )|.  (D.6) 

From  Equation  D.6  it  is  evident  that  a  F/D  interpretation  can  be 
placed  on  the  sliding  DFT  magnitude  as  well  as  on  the  STFT  magnitude. 
Since  it  is  common  practice  to  investigate  the  spectrum  of  a  signal  by 
examining  the  DFT  magnitude  of  a  signal  segment,  the  F/D  interpretation 
can  be  used  to  obtain  insight  into  spectral  behavior  as  a  function  of 
time.  Although  it  is  well  known  that  the  sliding  DFT  can  be  used  to 
implement  a  filter  bank  (Rabiner  and  Gold  l 56 1 ) ,  the  nature  of  the 
detection  process  brought  about  by  the  magnitude  operation  has  not  been 
adequately  discussed  in  the  literature.  Therefore,  a  complete  example  is 
given  in  the  remainder  of  this  section. 


For  convenience,  the  Hamming  window  (Oppenheim  and  Schafer  [31]) 
will  be  used  both  as  a  window  function  and  also  for  lowpass  filtering 
purposes.  The  Hamming  window  of  length  M,  normalized  for  unity  gain  in 
the  frequency  domain,  is: 

h'(n)  =  [.54  -  .46cos(2irn/M-l)]/(.54M),  (Xn<M-l, 

=  0,  otherwise.  (D.7) 

The  one-sided  main  lobe  bandwidth  of  a  Hamming  window  of  length  M  is 
o)j1«=4it  /M. 

As  a  specific  example,  assume  that  a  continuous-time  signal  is 
sampled  at  a  10  KHz  rate  and  a  12.8  millisecond  (128-point)  segment  is 
selected  for  analysis.  Let  the  data  segment  be  denoted  by  x(n),  CKn<127. 
An  example  sequence  is  shown  in  Fig.  D.l. 

Using  a  128-point  Hamming  window,  the  DFT  magnitude  squared  is 
computed.  The  DFT  is  given  by  Equation  D.l,  and  frequency  spacings  are 
given  by  Equation  D.2,  where  k”0, l , . . .  ,64.  Since  x(n)  is  real,  it  is  not 
necessary  to  compute  values  for  k*65, . . . , 127.  The  DFT  magnitude  squared 
for  the  sequence  of  Fig.  D.l  is  shown  in  Fig.  D.2. 

A  bank  of  discrete-time  F/D  subsystems  of  the  type  shown  in  Fig.  2.8 
is  now  implemented,  and  the  output  of  the  F/D  bank  is  sampled  at  a 
specific  time.  Since  the  Hamming  window  is  symmetric,  the  DFT  window 
function  h'(n)  is  the  same  as  the  STFT  window  function  h(n).  Thus,  in 


Fig.  2.8,  h(n)  is  a  128-point  Hamming  window,  uc"wjc,  and  6  is  arbitrarily 
chosen  as  zero.  Since  the  Lowpass  smoothing  filter  must  have  the  same 
bandwidth  as  the  bandpass  filter,  a  63-point  Hamming  window  is  used  for 
hsj(n).  The  smoothing  filter,  therefore,  introduces  a  32-sample  delay. 
For  convenience,  define  x(n)=0  for  n<0  and  n>127.  Let  the  output  of  each 
F/D  subsystem  be  denoted  by  2vjc(n).  To  approximate  the  DFT  results, 
2v|.(158)  is  computed: 

62  127 

2vk(158)  =  2  ]_  [  ^  x(  158-i-m)h(m)cos(w]!tm)  l^hgjd).  (D.8) 

i=0  m=0 

The  results  are  plotted  in  Fig.  D.3,  and  are  comparable  to  those  of  Fig. 

D.2. 
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D.5  AVERAGE  POWER  SPECTRUM  ESTIMATION 


In  certain  applications,  reconstruction  of  high  quality  speech 
signals  directly  from  spectral  magnitude  data  is  of  interest.  As 
demonstrated  in  Chapter  4,  a  high  degree  of  time-domain  detail  in  the 
data  is  essential  for  such  applications.  In  many  other  applications, 
however,  exact  signal  reconstruction  is  not  required  and  the  time-domain 
detail  can  be  eliminated  by  averaging  the  spectral  magnitudes. 

For  example,  since  one  noise  sequence  may  sound  the  same  as  many 
others,  retention  of  information  for  exact  signal  reconstruction  is 
unnecessary.  For  data  reduction  purposes  it  is  more  efficient  to 
characterize  the  random  process  which  originally  created  the  data. 
Synthesis  is  then  accomplished  by  generating  a  new  data  sequence  from  a 
random  process  which  has  the  same  characterization  as  the  original  data 
sequence.  Since  a  random  process  is  often  described  in  terms  of  its 
power  spectral  density,  average  spectral  magnitudes  are  useful  for 
estimating  random  process  characteristics.  This  approach  is  generally 
employed  by  the  channel  vocoders  described  in  Section  D.2. 


The  continuous-time  F/D  subsystem  of  Fig.  B.3a,  without  the  square 
root  device,  can  be  used  to  measure  the  average  power  spectrum  of  speech 
(Dunn  and  White  (66] ).  The  speech  signal  is  decomposed  into  a  number  of 
frequency  bands  by  a  bank,  of  bandpass  filters.  The  mean  squared  power  in 
each  band  is  computed  by  placing  a  square  law  device  and  smoothing  filter 
at  the  bandpass  filter  outputs.  The  smoothing  filter  time  constant  may 
range  from  125  milliseconds  for  short-time  measurements  to  more  than  a 
minute  for  long-time  analysis.  A  long-time  analysis  can  also  be  obtained 
by  averaging  many  short-time  measurements. 

Digital  techniques  can  also  be  used  for  power  spectrum  estimation. 
For  example,  a  popular  technique  known  as  the  Welch  method  can  be 
described  in  terms  of  a  digital  F/D  bank.  The  Welch  spectrum  estimate,  as 
discussed  by  Oppenheim  and  Schafer  (31],  is  computed  by  sampling  the 
sliding  DFT  in  a  manner  equivalent  to  hopping  with  no  overlap.  This  is 
done  in  an  attempt  to  ensure  statistical  independence  of  the 
measurements,  and  yields  an  undersampled  representation.  Hopping  with 
overlap  has  also  been  discussed  in  the  literature  (Welch  (67J),  but  will 
not  be  considered  here.  Magnitude  squared  samples  are  averaged  for  each 
frequency,  and  weighted  by  a  constant  which  depends  on  the  window 
function.  The  Welch  spectrum  estimate  is  therefore  equivalent  to 

sampling  F/D  bank  outputs  and  averaging  the  samples  in  each  channel  to 
determine  the  power  spectral  density  of  the  input  noise  process.  The 
Welch  method  thus  obtains  a  long-time  measurement  by  averaging  short-time 
measurements . 
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In  precise  terms,  the  Welch  spectrum  estimate  is  given  by: 

J“k  J?  „ 

Bxx(e  )  =  (1/PQ)  l  lYpM-iCe  )|2  (D.9) 

P=1 


where 


Q  = 


M-l 

\ 

m=0 


(h'(m)J2, 


(D.  10) 


h'(m)  is  the  DFT  window  function,  Yn(e  )  is  the  sliding  DFT  given  by 
Equation  D.3,  and  the  data  sequence  is  x(n),  CKn^MP-l.  It  can  be  seen 
from  Equation  D.6  that  the  STFT  can  also  be  used  to  compute  the  Welch 
spectrum  estimate,  as  long  as  the  STFT  window  function  is  finite  length 
and  the  window  time-reversal  and  delay  are  taken  into  account. 
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D.6  NONUNIFORM  BANDWIDTH  ANALYSIS 


Although  a  critical  bandwidth  filter  bank  is  useful  for 
perception-based  speech  analysis,  such  filter  banks  are  not  always 
readily  available  in  the  form  of  existing  electronic  equipment  or 
computer  programs.  The  most  common  type  of  digital  filter  bank  consists 
of  many  narrow  bandpass  filters  which  are  uniformly  spaced  in  frequency, 
all  filters  having  the  same  bandwidth.  These  filter  banks  are  often 
implemented  by  the  sliding  DFT,  since  the  sliding  DFT  can  be  efficiently 
computed  via  the  FFT  algorithm  (see  Section  D.4).  Such  filter  banks  must 
be  modified  for  perception-based  analysis,  allowing  the  filters  to  have  a 
bandwidth  which  varies  with  center  frequency.  Modifications  generally 
involve  combining  the  outputs  of  several  narrowband  filters  in  order  to 
simulate  a  single  filter  of  broader  bandwidth.  Although  such 
modifications  can  be  used  in  dealing  with  Linear  Time-Invariant  (LTI) 
systems,  effects  of  the  nonlinear  detection  process  which  follows  the 
filter  bank  must  also  be  taken  into  account. 

This  section  presents  several  approaches  to  variable  bandwidth 
analysis  which  can  be  implemented  by  modifying  narrowband  filter  banks. 
Unfortunately,  if  these  approaches  achieve  the  desired  result  at  all, 
they  do  not  approach  the  computational  efficiency  of  the  Generalized 
Short-Time  Fourier  Transform  (see  Appendix  C).  Nonetheless,  since  the 
approaches  presented  in  this  section  are  commonly  used  in  practice,  it  is 
worthwhile  to  investigate  the  problems  associated  with  each  method. 


D.6.1  SUMMATION  OF  FILTER/DETECTOR  OUTPUTS 

The  sliding  DFT  is  often  used  to  implement  a  bank  of  many  narrow 
bandpass  filters  (Rabiner  and  Gold  [56]).  The  DFT  magnitude  can  thus  be 
Interpreted  as  a  time  sample  of  a  narrowband  F/D  bank  output.  The 
narrowband  analysis  may  be  broadened  as  required  by  adding  together  two 
or  more  F/D  outputs,  where  the  filters  are  adjacent  in  frequency. 
Although  this  approach  broadens  the  steady  state  sinusoidal  response,  it 
will  be  shown  that  all  outputs  have  the  same  fora  of  impulse  response. 
Since  it  is  desirable  to  have  shorter  impulse  response  duration  on  the 
high-frequency  wlde-bandwldth  F/D  subsystems,  as  shown  in  Fig.  2.11, 
usefulness  of  this  approach  is  diminished. 

To  demonstrate,  let  wa  and  ufe  be  two  analysis  frequencies  of 
interest.  Define  a  broadened  F/D  output  as: 

>a  „  J«b  „ 

Zi(n)  -  | Yn(e  )|2  +  |Yn(e  )|2  (D.ll) 

The  two  sliding  DFT  components  thus  Implement  a  pair  of  narrowband  F/D 
subsystems  which  are  added  together  to  form  a  broadened  F/D.  When  the 
input  is  an  Impulse,  x(n)»<(n),  the  broadened  F/D  output  is  Zi(n)“2h2(n). 
The  broadened  F/D  output  thus  has  the  same  form  of  Impulse  response  as 
either  of  the  two  original  narrowband  F/D  subsystems. 


Adding  together  F/D  outputs  to  decrease  frequency  resolution  falls 
to  give  a  corresponding  laprovement  In  time  resolution.  Important 
temporal  information  may  be  lost  due  to  this  "smearing"  effect.  It  can 
easily  be  shown  that  the  same  result  holds  whether  the  F/D  subsystems  are 
implemented  via  the  sliding  DFT  or  Implemented  directly  by  using 
Individual  bandpass  filters,  memoryless  nonlinearities,  and  lowpass 
smoothing  filters. 


D.6.2  SUMMATION  OF  FILTER  OUTPUTS  PRIOR  TO  DETECTION 

Filter  broadening  is  commonly  accomplished  by  adding  together  the 
outputs  of  several  adjacent  (in  frequency)  filters  prior  to  the  detection 
process.  Although  the  desired  filter  broadening  is  achieved,  it  will  be 
shown  that  undesirable  components  may  appear  in  the  impulse  response  of 
the  resulting  F/D  subsystem. 

Consider  the  impulse  response  of  the  directly  implemented  F/D  shown 
in  Fig.  D.4a.  When  x(n)=6(n)  the  output  is: 

Z2(n)  =  2h2(n)[l  +  cos(oJa“Wb^n J  •  (D.12) 

Note  the  presence  of  a  high  level  beat  frequency  component. 

Beat  frequencies  are  also  present  when  the  sliding  DFT  is  modified 
by  adding  adjacent  complex  results.  Define  a  broadened  F/D  output  as: 

jma  j^b 

Z3(n)  =  | Yn(e  )  +  Yn(e  )|2  (D. 13) 

jua  „  J^b  „  Jwa  j^b  . 

-  |Y„(e  )|2  +  |Yn(e  )|2+2Re{Yn(e  )[Yn(e  )]*}, 

where  the  asterisk  denotes  complex  conjugation.  The  block  diagram  for 
this  subsystem  is  shown  in  Fig.  D.4b.  It  is  easily  seen  that  the 
bandpass  filters  have  the  desired  broadened  characteristics. 


To  investigate  the  dynamic  characteristics  of  the  subsystem  shown  in 
Fig.  D.4b,  let  x(n)^5(n).  The  output  then  becomes: 

Z3(n)  =  2h2(n){  1  +  cos[  (a)a-Wb^n-M+^^}  •  (D.14) 

The  impulse  response  of  the  new  broadened  F/D  subsystem  thus  contains  an 
undesirable  beat  frequency  term. 

The  beat  frequency  in  the  sliding  DFT  becomes  more  pronounced  (ie., 
more  beat  cycles  are  evident  in  the  F/D  impulse  response)  when  is 

large.  The  effect  is  minimized  if  two  adjacent  filters  are  added.  For 
addition  of  two  adjacent  filters,  it  follows  from  Equation  D.2  that: 

Z3( n)  =  2h2(n){l  +  cos[  2tt  (n+l)/M]}  .  (D.15) 

As  a  specific  example  consider  a  128-point  sliding  DFT  using  a 
Hamming  window;  ie. ,  M=128  and 

h'(n)  =  .54  -  .46cos [ 2ir  n/(M-l ) ] ,  (Xn<M-l 

=  0,  otherwise.  (D.16) 

Since  the  Hamming  window  is  symmetric  it  follows  from  Equation  D.4  that 
h'(n)=h(n).  The  impulse  response  of  an  original  F/D  subsystem,  h2(n),  is 
shown  in  Fig.  D,5a.  The  impulse  response  of  the  broadened  F/D,  as  given 
by  Equation  D.15,  is  shown  in  Fig.  D.5b.  In  this  example  of  a  broadened 
F/D  subsystem,  a  single  impulse  input  results  in  two  peaks  at  the  output, 
which  is  generally  an  undesirable  result. 


As  noted  by  Rabiner  and  Gold  [56],  the  equivalence  between 
multiplication  in  the  time  domain  and  convolution  in  the  frequency  domain 
implies  that  windowing  can  be  accomplished  by  a  complex  weighted 
summation  of  many  adjacent  (in  frequency)  values  of  the  sliding  DFT.  A 
carefully  chosen  combination  of  weights  can  be  used  to  modify  the 
original  analysis  window,  and  can  reduce  or  eliminate  beat  frequency 
effects.  Thus,  although  it  is  possible  to  broaden  filters  by  this 
r  approach,  computational  efficiency  is  sacrificed. 


D.6.3  SUMMATION  OF  STFT  COMPONENTS  PRIOR  TO  MAGNITUDE 

STFT  results  are  lowpass  functions,  and  are  unlike  the  bandpass 
results  produced  by  the  sliding  DFT •  Thus,  no  beat  frequencies  will 
occur  when  adjacent  (in  frequency)  complex  STFT  results  are  added  and  the 
magnitude  squared  is  computed.  Let  the  broadened  STFT  analysis  be  given 
by: 


j“>a  J“b 

Z4(n)  =  |Xn(e  )  +  Xn(e  )|2,  (D. 17) 

where  u>a  and  ujj  are  two  STFT  frequencies  of  interest.  The  subsystem  block 
diagram  is  shown  in  Fig.  D.6.  When  the  input  is  an  impulse  at  time  m, 
x(n)=^(n-m),  the  output  is: 

Z4(n)  “2(1+  cos(wa-o!tj)'n]h2(n-ra).  (D.18) 

Thus  the  subsystem  has  a  time-varying  impulse  response,  which  is  clearly 


an  undesirable  result 


%T.  »  ^  '  * 


INPUT 


sin(wan)  +  sin(u>bn) 


Figure  D.6:  Summation  of  STFT  Components  Prior  to  Magnitude 
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OUTPUT 


244 


D.7  CONCLUSION 


In  this  appendix,  the  new  relationship  between  STFT  magnitude 
squared  and  F/D  subsystems  was  used  to  describe  the  characteristics  of 
several  speech  analysis  and  synthesis  systems.  The  relationship  provides 
a  common  basis  for  understanding  the  operation  of  many  systems,  and  can 
be  used  to  indicate  similarities  and  differences  between  various  systems. 
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